Auditing Data Cleaning Updates

How to track what has been updated by data cleaning processes

The Problem

A common problem when building a data warehouse is to track and audit which records have had data quality updates or cleaning applied to them. A simple method is to add a bitmap that reflects the data quality updates have been applied.

Example

If we have a staging table called Customers that has only three columns as follows:

CUST_ID NUMBER(38)
SURNAME VARCHAR(20)
FORNAME VARCHAR(20)

We add an additional column called DQAUDIT NUMBER(38) and set it to 0 (zero)

The data we have (in csv format for ease) is:

1, Walker, David,0
2,Fred,Bloggs,0
3,John, Smith,0

The cleaning that is required is simple, it is to remove the leading spaces from both the forename and surname field. The surname update will affect record 1, whilst the forename update will affect records 1 and 3.

We can now perform our data cleansing as follows:

UPDATE CUSTOMER
SET SURNAME=LTRIM(SURNAME),
DQAUDIT=DQAUDIT+POWER(2,0)
WHERE SURNAME LIKE ‘ %’
/

UPDATE CUSTOMER
SET FORENAME=LTRIM(FORENAME),
DQAUDIT=DQAUDIT+POWER(2,1)
WHERE FORNAME LIKE ‘ %’
/

Note that with each query we increment the power by 1, this effectively sets a bitmask on the data quality

Our output now looks like this:

1,Walker,David,3
2,Fred,Bloggs,0
3,John, Smith,2

Where 3 represents the both queries being run and 2 represents only the second query being run.

Since this is a bit mask we can work out which queries have updated any given row in the database.

Notes

  1. The example used here is done in Oracle format but can be implemented in other database technologies
  2. The example is simplistic to show the power of the method and not as a full discussion of data cleaning
  3. Readers are expected to understand bitmaps and masks
  4. Each table should have its own power sequence i.e. start each table from zero when calculating the power

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.