Building Data Marts

This article descibes building data marts from a data warehouse or transaction repository. It discusses the merits of full and incremental loads for different data marts.

 

The types of data marts to load

Having built our transaction repository or data warehouse the next job is to build our data marts. The design of data marts is discussed elsewhere however it is important to note that there are two distinct types of data mart that will affect the loading strategy we are deploying:

  • Temporal Data MartsThese are data marts where all the history is displayed, that is each dimension reflects a number of historical hierarchies and the facts are stored against the appropriate hierarchy. An example of this would be an organisational hierarchy that has changed over time and facts are reported against hierarchy at the time the event took place rather than the hierarchies current shape. These data marts are used for in depth trend analysis but tend to be too slow for users who want to view the current position.
  • Current or Non-Temporal Data MartsThese are data marts where all the history is not retained and the hierarchies in the dimensions reflect the current hierarchy only. For example the organisation hierarchy would reflect current management over all historical facts and not allow analysis of how organisational change has affected the facts. This has the advantages for current analysis that it reduces the size and simplifies the data in the dimensions and if the facts contain status information will also reduce the size of the fact table.

Loading a Temporal Data Mart

The loading of a temporal data mart can take one of two routes

 

  • Complete RefreshGiven the fact that the transaction repository is holding all the data in nearly the right format it is often easy to write the data mart load in such a way that it reloads completely each time. This is done by truncating the tables and then just loading the data again. The big advantage is that it captures every thing that is in the transaction repository and reduces the backup requirements, however if the data mart is going to get very large over time then the performance of the load will deteriorate and therefore a cumulative refresh (see below) may be more appropriate. Complete refreshes never have to worry about whether they have captured all the data if more history is added retrospectively.
  • Cumulative RefreshGiven that this is a temporal data mart and that the design will have retained the last data warehouse key loaded it is possible to just append to the data mart all facts from transaction repository that have appeared since the last load. This is particularly useful where there is some issue in the production system and the data mart is not loaded for a day or two as it can detect and load all data since the last load. This speeds up the load process considerably for large data marts (although it is worth noting that performance will still deteriorate over time as the target table is still getting bigger). Cumulative refreshes have to be backed up more reliably and can not load history, although manually truncating the table will in effect force a complete refresh.

Loading a Non-Temporal Data Mart

Loading of non-temporal data marts is normally done as a complete refresh. This is because the load will become difficult due to the updates required on facts and particularly on dimensions. However there are two possible sources of the data:

  • Loading from the Transaction RepositoryIf we load from the transaction repository we are going directly to the source and we can build the current data mart without relying on the temporal data mart having been loaded first. This gives us the distinct advantage where time to make the new data available is critical however it may be a longer load than using the temporal data mart as it will have to repeat denormalisations already carried out in the temporal data mart.
  • Loading from the Temporal Data MartThis method is attractive because not only does it take advantage of the work already done by the load of the other data mart but with careful design may be able to carry keys through that would allow drill across to the temporal data mart. As long as the dependency on the load time for the temporal data mart is not an issue then this would be the preferred route.

Performance considerations

Often when loading data marts designers will build complex queries to try and build the complete dimension or the complete fact. It is important to remember that for a data mart build all constraints can be disabled and then a number of simple queries constructed. This first query (or queries where multiple sources are required) should create all the rows, however other attributes can be done as a number of update statements onto the table afterwards. This will often be considerably faster and less prone to falling over than the giant query that will use lots of memory and temporary space get the result set required. It also makes the process restartable.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.