ETL tool transformations, data models, query tools & report writers, analysis documents and data quality reports are all part of the collection of information gathered during the build of a data warehouse. Once in production the tracking of issues that arise from day to day operations and the relationship to the ETL process also need tracking.
As a consultancy we normally steer clear of recommending specific products as generic solutions and only do so for particular clients who need tools to meet specific situations. However we have generally noted most organisations do not do the configuration management and issue tracking well and that commercial solutions, unless part of an existing corporate infrastructure, are costly to deploy.
If this is normally so expensive what should we do ?
The premise of this article is using open source software. Don’t stop reading – like the software this advice is free so it is worth a look.
Whatever is wrong with individual pieces of open source software the one thing that has worked well is the issue tracking and source code control. Sites such as http://sourceforge.net have over 82,000 projects and 858,000 users (as at 1st June 2004).
There are two pieces of software that stand out and are widely used for configuration management and issue tracking. These are:
- CVS: CVS is a version control system and important component of Source Configuration Management (SCM) that records the history of sources files and documents.
- Bugzilla: Bugzilla is one example of a class of programs called “Defect Tracking Systems”, or, more commonly, “Bug-Tracking Systems” that groups of developers to keep track of outstanding bugs in their product effectively.
What does CVS offer me ?
CVS provides a repository on a server (Windows or Linux – see later) that can store and version files (including binary files and special format files such as Microsoft Word). When you create a file you ‘check it in’ and when you want to read it again or deploy it you ‘check it out’. If you want to edit the file then you check it out in a locked mode to prevent others editing it whist you are. It is also possible mutliple developers to check out the code, work on it and check it back in and ask for conflict resolution etc. Finally you can tag ‘releases’ and have code branhes off the main development that allow new concepts to be tried without losing any of the code already developed
By default it comes with a faily standard Unix style command line interface but there are many useful client interfaces such as WinCVS that provides a graphical user interface for most platforms (Windows/Unix/Linux/Mac) and CVSWeb that provided a browser based interface to view the repository. The WinCVS site also has links to other clients and other sites about CVS. A fully downloadable book can also be found at http://cvsbook.red-bean.com/
What does Bugzilla offer me ?
Bugzilla provides a simple web and e-mail based issue tracking system that allows mutiple products (or projects) and their components to have issues raised and prioritised and tracked through to resolution.
If features :
- integrated, product-based granular security schema
- inter-bug dependencies and dependency graphing
- advanced reporting capabilities
- a robust, stable RDBMS back-end
- extensive configurability
- a very well-understood and well-thought-out natural bug resolution protocol
- email, XML, console, and HTTP APIs
- available integration with automated software configuration management systems, including Perforce and CVS (through the Bugzilla email interface and checkin/checkout scripts)
So where is the catch ?
There isn’t one, all you need is a small Windows or Linux server that runs a web server such as IIS or Apache with enough disk to hold your repositories. Our 10,000 issue bugzilla database takes 5Mb and we have about 1Gb or source code from mutliple projects. A machine of entry level standard with a modern but not superfast processor, 0.5Gb memory and 100Gb hard disk and a backup device (recordable DVD or tape drive) with an OS of your choice will easily support most organisations.
What should we put into CVS ?
The CVS repository should hold everything you need to (re-)build your data warehouse this will include:
- Requirement Specifications
- Analysis and Design Documents
- Data Models (if you use a tool such as ErWin or Oracle Designer take an export of the repository and check it in after you have made any changes
- Universes and End User Layers (if you use a tool such as Business Objects or Oracle Discoverer export the metadata for the business layer and report definitions)
- System Configuration scripts (how the systems and databases that hold your data warehouse are built)
- Manuals and user/system docuemntation
- ETL Metadata exports
- Any other metadata
- Any other documents
What should we put into Bugzilla ?
The Bugzilla repository should hold everything you need to track issues, changes and enhancements to your data warehouse this will include:
- All issues (prioritised via the tool)
- All enhancement requests
- All dependencies
- Links to the CVS repository
- Comments and observations about the issue
- Issue status
- Anything else that relates to a bugzilla feature
If your orgainsation has well organised procedures and development practices with existing tools, has a large budget to buy tools and external consultants to deploy them, or is so completely adverse to the fact that something available for free via open source could be useful then our plea in the second paragraph to ‘not stop reading … this advice is free’ was misguided.
If, however, your project:
- has no budget for this type of facility but you’ve realised you desparately need it
- has lost a day/week/month’s worth of work and don’t want to make the same mistake after you have recovered
- has an enthusiastic techie who is willing to set it up and have a go
- needs to prove it is in control of the development cycle to the project office/auditors
- just wants to do the right thing
then hopefully this article has given you some pointers, and of course if you want some further help you can also contact us