Sunday, March 11, 2012

Auditing Data Changes

I am using the SCD Wizard and it is working nicely. Can someone point be to an article/tuorial that would explain how you could create an "audit trail" on the items that may have been changed (type I and II)?

Basically, what I want to be able to do is run a query that tells me what data may have changed. I figured I would have to create an auditkey field in my table which would then link the key to the change detail?

The idea behind type 1 changes is that the changes AREN'T tracked. That's the point of type-1 SCDs. Type 2s ARE tracked by their very nature, so reporting on them should be easy. Type 2 records that have changed have "expired" records either indicated by some current flag or by some end date field.|||

Hi Jrp210,

I know of no articles or white papers on this, although some probably exist.

In my mind, this is ETL metadata. In larger ETL, I currently build a database that tracks row counts for the source data and then for different types of changes (including no change). It provides a the means for a quick "sanity-check" at the end of each load: I add up the rows coming out of the SCD Wizard outputs and see if this total matches the rows going in.

If I'm understanding you, you would like to take this to another level and store enough information to identify the rows that have changed, along with which type of change was made.

To tackle this scenario, I would probably use a small field appended to each destination dimension and fact table - tinyint would probably do the trick. If you make this field Null-able, I would start by updating this field - setting it to Null - before each load. For instance, let's say you have a table called dbo.DimStuff that looks like this:

StuffSK int identity(1,1),

StuffBK varchar(50),

StuffName varchar(50)

You could add another field:

StuffChangeType tinyint NULL

When your SSIS package starts, you could include an ExecuteSQL Task with the following statement:

UPDATE dbo.DimStuff

SET StuffChangeType = NULL

This would set all the StffChangeType values to Null prior to executing the load. Then you could add a Derived Column to each output of the Slowly Changing Dimension Wizard that adds a column named StuffChangeType and assigns an appropriate value (maybe 1 for updates, 2 for historical, 3 for new, etc.)

This would add change type data for each row changed - and let you know which rows were not changed.

You could then execute a query like:

SELECT StuffBK, StuffChangeType

FROM dbo.DimStuff

This would give you a picture of the business key and change types. You could qualify with a where clause:

WHERE StuffChangeType IsNotNULL

to only get changes (and change types) since the last load.

You could also do a nice summary with a query like:

SELECT StuffChangeType,Count(*)AS'ChangeTypeCount'

FROM dbo.DimStuff

WHERE StuffChangeType IsNotNULL

GROUPBY StuffChangeType


Hope This Helps,

Andy

|||But Andy, if you want to track Type 1 changes, make them Type 2. When using Type 2 SCDs, use begin and end dates to identify current and historical rows. The auditing is done for you by the nature of SCDs. That's the point of them in the first place. If you want to keep track of which LOAD processed the change, add another column, SessionKey, or something like that.|||

Hi Phil,

I think he's asking for something different from SessionID, and for something more than just Type 2 tracking. To me, it sounds as if he has a business requirement to update certain data - or perhaps lacks a business requirement to track changes on that data - and still wants to track the fact that it changed. I don't think SCDs cover that, and I think the solution I offered does. It's not the only way, but I think it will accomplish what he's after.

If not, perhaps it will help someone else in the future! Smile

Thanks,

Andy

|||Along the path you're thinking Andy, the Kimball Webcast on the first page of this forum (towards the top) talks about auditing. In there, an audit entry is recorded and an identity value is returned that is used within the applicable pieces within SSIS. This value would correspond to the "session" of this load, for this package, etc... Talks about row counts, and other things. Very valuable.|||

Hi Phil,

Yep - that's good stuff.

I don't think that's what the original question was about, but it's definitely good stuff.

Thanks!

Andy

|||

Thanks for the responses. The SCD does do what it is intended to do but I probably should clarify a bit.

Perhaps an example will be the best way to explain.

For the sake of the example I have the following fields in my table:

CompanyKey

CompanyName

CompanyTicker

StartDate

EndDate

The SCD is setup to create historical changes for the name and ticker. This is done with StartDate and EndDate fields. Phil, you are right that I will have a date in the EndDate field if a change comes through for a CompanyName or CompanyTicker. But what I don't know unless I compare the two rows is what exactly changed. Now this is fairly easy to eyeball given this simple scenario. But in my real life scenario I have many fields that could have changed and its not easy to tell which one(s) did. I would like to create a way to determine which of the fields did change.

Hope this makes sense.

No comments:

Post a Comment