Much of data warehousing is dependent on running regular jobs to get collect data and load it into the system. In order to demonstrate both fixed and relative timing methods we use an example as outlined below:
A warehouse has two source systems. The first system produces an extract file once a night at a given time (for this example at 8pm). The second system provides a set of transactions for a period of time (for this example every half an hour) then sends a file to the data warehouse server. Once the extract from the first system and all the available files from the second system are loaded a data mart can be built from the data warehouse
Note that since this is an example of scheduling the complexity has been removed, however more discussion on the complexity of the load process and the use of directed graphs (or digraphs) can be found in our knowledge base.
Fixed time job scheduling
Fixed time job scheduling (also known as absolute time job scheduling) assumes a number of known times for actions to occur. In our example the first system provides a file at 8pm and we might therefore reasonably schedule our job at 8.30. It will need to check if the file exists, and if not it may choose to sleep for a period of time before checking again. At some point, after say three tries it will decide to fail the job for that night and try again tomorrow.
The files from the second system can either be loaded by a second cron job, or as either a dependency or precursor to the first job. In either case the elapsed time will contain some element of contingency against the planned file delivery time.
Furthermore the loading of the second system will all occur during the critical batch window for the day and (assuming our half hour pushes) will have forty-eight files to load.
Relative time job scheduling
Unix and many commercial schedulers offer an alternate to fixed time. In unix this is implemented via the ‘at’ command. This simple mechanism allows a much more granular approach to scheduling data warehouse loads.
Once again we can assume that the file exists and is ready to load. If it is then the file will be loaded. However whether a file is loaded or not the script executes the following:
at -f script 2>>script.log now +5 minutes
where script is replaced with the path and name to the script and script.log is the path and name to the log file you want.
What now happens is the script will run again 5 minutes after the completion of the previous script. The ‘at’ interval can be tuned from minutes to years. It may appear that we could achieve the same thing by setting the cron interval to 5 minutes but this is not so.
Comparison of methods
If we use the fixed method and schedule every five minutes ‘cron’ the we have no problem with the first extract, however the second extract has an issue around controlling the number of files being processed. The first job that starts will grab the available file. It will still be processing this file when the second job starts and therefore two things may happen. Firstly without careful programming the file may be re-processed. Secondly the system resource load is rising as I now have two jobs serving the same purpose running.
With the relative method the first extract again has no issues although the delay will now be much lower than the scheduled version as no contingency is required. The second system extract will however pick up the first file. Only once it has finished processing the file will it start waiting for another five minutes. This avoids the risk of re-processing files and manages system resource load becuase there is only ever one job running.
If we use relative scheduling we also find that our batch window has shrunk because having loaded files thoughout the day as they become available they are now no longer part of the major load during the critical batch window.
In practice both forms of scheduling have a place in building a data warehouse however the use of relative timing verses fixed timingis often over looked and therefore many data warehouse batch schedules are longer than they need to be.
With some simple analysis and the use of relative timing batch window time usage is greatly reduced.