A common problem when building a data warehouse is to track and audit which records have had data quality updates or cleaning applied to them. A simple method is to add a bitmap that reflects the data quality updates have been applied.
If we have a staging table called Customers that has only three columns as follows:
We add an additional column called DQAUDIT NUMBER(38) and set it to 0 (zero)
The data we have (in csv format for ease) is:
1, Walker, David,0
The cleaning that is required is simple, it is to remove the leading spaces from both the forename and surname field. The surname update will affect record 1, whilst the forename update will affect records 1 and 3.
We can now perform our data cleansing as follows:
WHERE SURNAME LIKE ‘ %’
WHERE FORNAME LIKE ‘ %’
Note that with each query we increment the power by 1, this effectively sets a bitmask on the data quality
Our output now looks like this:
Where 3 represents the both queries being run and 2 represents only the second query being run.
Since this is a bit mask we can work out which queries have updated any given row in the database.
- The example used here is done in Oracle format but can be implemented in other database technologies
- The example is simplistic to show the power of the method and not as a full discussion of data cleaning
- Readers are expected to understand bitmaps and masks
- Each table should have its own power sequence i.e. start each table from zero when calculating the power