The convention over configuration discussion when designing a data warehouse often leads to fanatical discussions by technical people over the ‘best’ approach to do things.
Convention can be used to define the standard way to design a data model (see Process Neutral Data Modelling) or to implement an ETL transformation (for example a reducing sets approach) amongst other things but inevitably someone will claim to be able to make an individual element ‘better’ bet by configuration.
For example if we have an ETL mapping and always approach handling data change in a certain way then unless we are really sure of the benefits we should stick to the convention. In our example a DBA offers to make a single ETL mapping faster. The DBA can halve the time – so what is the harm?
Firstly if the new algorithm is as complete but just faster then there is no harm. In fact I would encourage you to update your conventions and re-factor your code so that all of the mappings exploit the algorithm.
But what if the faster algorithm works because it omits some element? Maybe it doesn’t need to check all the columns, perhaps there is an assumption of referential integrity, or it joins two steps together where other tables need the intermediate results.
This is where the cost creeps back in. In the future the omitted column is needed, referential integrity is no longer true, the intermediate step is required, etc. At best this causes un-necessary rework. At worse it causes silent corruption of the data that causes a crash or takes months to unravel. It also means that because it does not conform to the standard the person fixing the code has to work through the code to understand it first, and assuming they understand it perfectly, only then they can then fix it – all of this taking more resource time and cost.
The convention over configuration option does not mean that there is only one way to do something, nor does it mean that it is optimal for the individual case – it means that there are a defined set of algorithms for a given process and that these are collectively optimal for the solution – a case of looking at the whole picture rather than individual elements. In the process neutral data modelling technique it will often create a data model with over 1000 tables but there are 10 named approaches to the data modelling issue. How many ETL algorithms will you need? My guess is that it is in the same order of magnitude, about 100:1 against the total number or transformations. What is more the number of ETL transformation algorithms required drops dramatically if the data model is uniform because it is easy to describe a standard ETL algorithm for a standard data model.
Returning to the start of this article – your DBA can do this mapping better? But in what sense? Is it more cost-effective, more maintainable? Easier to understand, or cheaper to maintain? The decision to go the bespoke route for this one case should balance these factors and their costs against the necessity of the performance gain.
Experience tells me that in 99.9% of cases bespoke is not worthwhile in the long term.
This article was originally published on BIonRails, another Data Management & Warehousing website