Reprint of an article published by SAP about a recent piece of work
I have recently been asked to build an analytical platform for a project – but what is an analytical platform? The client (a retailer) described it as a database where they could store data, a front end where they could do statistical work ranging from simple means, standard deviations, etc. through to more complex predictive analytics which could be used to (for example) analyse past performance of a customer to assess how likely it is that the customer will exhibit a future behaviour, or using models to classify customers into groups and ultimately to bring these two together into an area known as decision models. The customer had also come up with an innovative way to resource the statistical skills needed by offering work placements to Masters students studying statistics at the local university and getting them to work with the customer insight team to describe and develop the advanced models; all they needed was a platform to work with.
From a systems architecture and development perspective we could describe the requirements in three relatively simple statements
- Build a database with a (very) simple data model that could be easily loaded, was capable of supporting high performance queries and did not consume a massive amount of disk space. It would also ideally be capable of being placed in the cloud.
- Create a web-based interface that would allow users to securely log on, write statistical programs that could use the database as a source of data and output reports and graphics and well as populating other tables (e.g. target lists) as a result of statistical models.
- Provide a way in which to automate the running of the statistical data models, once developed, such that they can be run without engaging the statistical development resources.
Of course time was of the essence and costs had to be as low as possible too – but we’ve come to expect that with every project!
Step 1: The database
Our chosen solution for the database was a Sybase IQ database, a technology our client was already familiar with. Sybase IQ is a ‘column store’ database. This means that instead of storing all the data in its rows as many other databases do the data is organised on disk by the columns. For example if a column contains a field for country it will have the text of each country (for example) ‘United Kingdom’ stored many times. In a column store database the text is only stored once and given a unique ID. This is repeated for each column and therefore the ‘row’ of data consists of a list of IDs linked to the data held for each column.
This approach is very efficient for reporting and analytical databases. Firstly by replacing text strings with an identifier significantly less space is used. In our example ‘United Kingdom’ would occupy 14 bytes, whist the ID might only occupy 1 byte – consequently reducing the storage for that one value in that one column by a ratio of 14:1 – and this is compression effect is repeated for all the data. Furthermore because there is less data on the disk the time taken to read the data from disk and process it for queries is significantly reduced which consequently massively speeds up the queries too. Finally each column is already indexed which again helps the overall query speed.
Another incidental but equally useful consequence of using a column store database such as Sybase IQ is that there is no advantage in creating a star schema as a data model – instead holding all the data in one large ‘wide’ table is at least as efficient. This is because by storing each column with a key it means that the underlying storage of data is a star schema. Creating a star schema in a column store database rather than a large single table would mean incurring unnecessary additional join and processing overhead.
As a result of choosing Sybase IQ as a column store database we are able to have a data model that consists of a number of simple single table data sets (one table for each different type of data to be analysed) that is both quick and easy to load and to query.
It should be noted that this type of database solution is less efficient for OLTP (Online Transaction Processing) type applications because of the cost of doing small inserts and updates. However this is not relevant for this particular use case.
The solution can only be deployed on a Linux platform. We use Linux for three reasons; firstly RStudio Server Edition is not yet available for Windows, secondly the availability of pre-compiled packages for all elements of the solution on Linux which reduces the install effort and finally hosted Linux environments are normally cheaper than Windows environments due to the cost of the operating system licence. We chose CentOS because it is a RedHat derivative that is free.
One additional advantage of this solution for some organisations is the ability to deploy it in the cloud. Since the solution only requires files to be remotely delivered and all querying is done via a web interface it is possible to use any co-location or cloud based hosting provider. Co-Location or cloud deployment offers a low start-up cost, reduced systems management overhead and easy access for both data delivery and data access. The system simply requires SSH access for management, FTP, SFTP or SCP for file delivery and the RStudio Web Service port open. RStudio Server uses the server login accounts for security but can also be tied to existing LDAP infrastructure.
Step 2: Statistical Tools and Web Interface
There are a number of statistical tools in the market, however most are very expensive, prohibitively so in this case and in addition the associated skills are hard to come by and expensive. However since 1993 an open source programming language called R for statistical computing and graphics has been under development. It is now widely used among statisticians for developing statistical software and data analysis and used by many universities and is predicted to be the most widely used statistical package by 2015. The R Project provides a command line and graphical interface as well as a large open source library of useful routines and is available as packaged software for most platforms including Linux.
In addition there is a second open source project called RStudio which provides a single integrated development environment for R and can be deployed on a local machine or as a web based service using the servers security model. In this case we have implemented the server edition in order to make the entire environment web based.
Therefore in two simple steps (download and install R, followed by download and install RStudio) we can install a full web-based statistical environment. Note that some configuring and pre-requisite packages may be required depending on your environment however these are well documented on the source websites and in general automatically download if using tools such as ‘yum’
The next step was to get access to the data held in our Sybase IQ. This proved to also be very easy. There is a Sybase white paper that describes the process that can be simply stated as:
- Install the R JDBC package
- Setup a JDBC connection
- Establish your connection
- Query the table
We now have an R object that contains data sourced from Sybase IQ that we can work with! And what is amazing is that I have built the platform from scratch in less than half a day’s work.
At this point data has to be loaded and the statisticians can get to work – obviously this is more time consuming that the build and over the days and weeks the analysts created their models and produced the results.
For this exercise we used our in-house ETL tool to create a repeatable data extraction and load process, however it would have been possible to use any of a wide range of tools that are available for this process.
Step 3: Automatically running the statistical models
Eventually a number of models for analysing the data had been created and we were ready to move into a production environment. We automated the load of the data into the agreed single table structure and wanted to also run the data models.
Sybase IQ has the ability to create User Defined Functions (UDF). These C++ programs talk to a process known as Rserve that in turn executes the R program and returns the results back to Sybase IQ. This allows R functions to be embedded directly into Sybase IQ SQL commands. Whilst requiring a little more programming experience to set up it does mean that all processing can be done within Sybase IQ.
Conversely it is possible to run R from the command line and call the program that in turn uses the RJDBC connection to read and write data to the database.
Having a choice of methods is very helpful as it means that it can be integrated with the ETL environment in the most appropriate way. If the ETL tool requires SQL only then the UDF route is the most attractive, however if the ETL tool supports host callouts (as ours does) then running R programmes from a command line callout is quicker than developing the UDF.
Business Intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data.
Traditionally building an analytical platform has required an expensive infrastructure and a considerable amount of time to setup and deploy.
By combining the high performance, low footprint of Sybase IQ with the open source R & RStudio statistical packages it is possible to quickly and easily deploy an analytical platform in the cloud for which there are readily available skills.
This infrastructure can be used both for rapid prototyping on analytical models and in running completed models on new data sets to deliver greater insight into the data