Detecting Changed Data

Introduction

When loading data warehouses, it is usually possible to decreases the load time very significantly by processing only changes since the last load, rather than completely refreshing all the data every time. This article describes one approach for detecting changes, which has been used successfully in a number of data warehouse projects.

Background

There are many different ways of capturing changes from an operational source system. This system may itself deliver change information, but this is quite rare. If the source system runs on a modern database, it may be possible to add triggers which capture the changed data. Usually concerns about performance or reliability prevent this approach. In most cases, the data warehouse load process receives a set of extract files which contain data from all relevant records, whether they have changed since the last load or not.

Data Load Approach

Our favoured approach, when dealing with extract files, is to load them into relational database tables with no change to the file structure and minimal validation. All the leading databases have fast and efficient utilities for loading data from files, and once the data is in a database, it is much easier to detect changes, as well as validating, cleaning and transforming the data into a structure suitable for loading into the data warehouse.

Figure 1

The Method

Figure 1 illustrates the technique for detecting changed data, once an extract file has been loaded into a database. The explanation of the diagram refers to Oracle, but the technique is applicable to any modern relational database.

Each box represents a database table, and summarises the process used to produce it.

The PREVIOUS_DATA and CURRENT_DATA tables represent two extracts of the same source data, which needs to be analysed to find new records, modifications and deletions.

Typically, PREVIOUS_DATA and CURRENT_DATA may be of any size, since they are copies of operational data. Normally, all the other tables shown in the diagram are small, since the numbers of new records, modifications and deletions are quite small during the time between extracting the PREVIOUS_DATA and CURRENT_DATA versions.

Each of the steps in the analysis uses set-level operations which work very efficiently in most situations.

The first step is to create the NEW_AND_MOD table, which will contain all the new and modified records. This can be done with the following SQL statement.

create table new_and_mod as
select * from current_data
minus
select * from previous_data;

If the table contains some control columns (such as date loaded, source file record sequence number, etc.) there should be a column list which excludes these, rather than the “*” which means compare all columns.

DEL_AND_MOD will contain all the deleted and modified records and is created as follows:

create table del_and_mod as
select * from previous_data
minus
select * from current_data;

These first two steps access all the data in the CURRENT_DATA and PREVIOUS_DATA tables twice. All further steps use only the changed data which is usually very much smaller.

The next three steps separate the keys of the new, modified and deleted records, as follows:

create table new_key as
select [key_list] from new_and_mod
minus
select [key_list] from del_and_mod;

create table mod_key as
select [key_list] from new_and_mod
intersect
select [key_list] from del_and_mod;

create table del_key as
select [key_list] from del_and_mod
minus
select [key_list] from new_and_mod;

[key_list] is specific for each table. Normally we generate this from metadata which records the definition of each table.

The final stage is to create the whole records for each of the cases: new, modified and deleted. This can be done as follows:

create table new_row as
select a.*
from
new_and_mod a,
new_key b
where
a.[key_list] = b.[key_list];

create table mod_row as
select a.*
from
new_and_mod a,
mod_key b
where
a.[key_list] = b.[key_list];

create table del_row as
select a.*
from
del_and_mod a,
del_key b
where
a.[key_list] = b.[key_list];

There may be no need to create the DEL_ROW table if the content of deleted rows is not interesting. When there is a need to keep full audit information for deleted rows, the DEL_ROW table would be required.

Pre-requisites

For this technique to work, every record must have a unique key. If there are two records with the same key then this process will report that each is a modification of the other. It is very important to identify unique keys in every source extract file. Without unique keys it is not meaningful to update existing records.

In a relational database, primary keys must be unique, and no part of the primary key may be null. The technique described here, does work even if part of the key is null, so long as the entire key is unique. This is because the set arithmetic (intersect and minus) does consider null to be equal to null, whereas the comparison operators used in the SQL where clause exclude null values – one null value is neither equal nor unequal to another value – they are both unknown.

Performance

This technique seems unlikely to be efficient for those of us familiar with online transaction processing systems, where the full table scan is to be avoided at all costs. However, when processing bulk loads, full table scans are often the most efficient method of processing. Running this process immediately after extracting the data, means that from this point onwards we are dealing only with the new, modified and deleted records. This can improve overall extract, transform and load times by several orders of magnitude.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.