This paper looks at a somewhat awkward data transformation, and at solutions written in SQL and in a procedural language. It describes some techniques which can be used to develop the solution in both languages. It also compares the solutions in terms of ease of development, performance and cost of maintenance.
Are we building transformations in the most effective way?
When working with other ETL developers, using the Oracle toolset to develop transformations, I often find that there is an assumption that the solution will be written in PL/SQL. PL/SQL is a very powerful language, and I have not found a transformation problem yet which could not be solved using PL/SQL. However, I believe that SQL – without the procedural framework provided by PL/SQL – is usually a better tool to use.
This paper looks at a somewhat awkward transformation, and at solutions written in SQL and PL/SQL. It describes some techniques which can be used to develop the solution in both languages. It also compares the solutions in terms of ease of understanding (so, how easy they are to develop and maintain) and performance. The underlying difference between the two approaches is that SQL encourages us to think in terms of sets of data, and PL/SQL is more amenable to record-at-a-time processing.
An example transformation for comparing two approaches
Figure 1 shows a typical transformation which an ETL developer might have to implement. This looks quite simple, but the business rules defining the transformation make the job more complicated.
Figures 2, 3 and 4 show some sample data for one incident, in the source tables INCIDENT and EVENT and in the resulting target table INCIDENT_EVENT.
Source and Target Incident-Event Models
There are a series of events for each incident. The join between the source INCIDENT and EVENT tables is simply on incident_number. Then to produce the target INCIDENT_EVENT it is necessary to think of the events being ordered by event_date within each incident (as shown in Figure 3). In the description which follows, the “current” event is whichever event we are currently looking at; the “previous” event is the event for the same incident which immediately precedes it in time, and the “next” event is the one which immediately follows it in time.
To create a row in INCIDENT_EVENT, we take INCIDENT_NUMBER and PRIORITY from INCIDENT. The START_DATE is the EVENT_DATE of the “current” event. END_DATE is the EVENT_DATE of the “next” event. PREVIOUS_STATUS is the STATUS of the “previous” event and CURRENT_STATUS is the STATUS of the current event.
The first event for an INCIDENT has no “previous” event, so PREVIOUS_STATUS is null for this row in INCIDENT_EVENT.
The last event for an INCIDENT has no “next” event, so END_DATE is null for this row in INCIDENT_EVENT.
Procedural or non-procedural?
The description of this transformation in terms of “an ordered list of events for each incident” might lead us to a procedural, record-at-a-time, solution. While such a solution will work, I would argue that a non-procedural, set-level solution is better.
To explore this argument, let’s take a look at both solutions.
The Non-Procedural, Set-Level, SQL Solution
Figure 5 shows a non-procedural, set-level solution, written in Oracle SQL. This looks quite complicated, so let’s break it down into more manageable pieces.
This solution makes extensive use of Oracle “inline views”, in which a sub-query takes the place of a table in the “from” clause. Other databases also support this functionality; in SQL Server, for example, they are called “derived tables”. Inline views help to partition the rules which are implemented by a complex query. Inline views can also provide solutions with good performance, especially for the bulk operations which are common in data transformation.
The section labelled with the comment ‘– A’ in Figure 5 finds the previous event for the current incident and the current (“this”) event. The previous event is the one with the maximum date which is less than the date of the current event. This sub-query returns an “inline view” called “link_previous”. If the current event is the earliest for the current incident the link_previous inline view will return no rows.
The sections labelled ‘– B’ in Figure 5 show the “previous” inline view, which gets the status of the previous event which we found in the “link_previous” inline view. We need to use nested views because the status is not one of the “group by” columns in the link_previous view.
The section labelled ‘– C’ in Figure 5 finds the next event for the current incident and the current (“this”) event. The next event is the one with the minimum date which is greater than the date of the current event. This sub-query returns an “inline view” called “link_next”. If the current event is the latest for the current incident the link_next inline view will return no rows.
The section labelled ‘– D’ in Figure 5 joins INCIDENT, the current event (EVENT), the previous event (previous) and the next event (link_next). There are outer joins to “previous” and “link_next” so that we get the first event (which does not have a previous event), and the last event (which does not have a next event) for each incident.
Figure 5: A Non-Procedural, Set-Level Solution
insert into incident_event ( incident_number, priority, start_date, end_date, previous_status, current_status ) select incident.incident_number incident_number, incident.priority priority, this.event_date start_date, link_next.next_date end_date, previous.previous_status previous_status, this.status current_status from incident incident, event this, ( -- C select -- C a.incident_number, -- C a.event_date current_date, -- C min(b.event_date) next_date -- C from -- C event a, -- C event b -- C where -- C a.incident_number = b.incident_number and -- C a.event_date < b.event_date -- C group by -- C a.incident_number, -- C a.event_date -- C ) link_next, -- C ( select -- B previous_event.incident_number, -- B link_previous.current_date current_date, -- B previous_event.status previous_status -- B from -- B event previous_event, -- B ( -- A select -- A a.incident_number, -- A a.event_date current_date, -- A max(b.event_date) previous_date -- A from -- A event a, -- A event b -- A where -- A a.incident_number = b.incident_number and -- A a.event_date > b.event_date -- A group by -- A a.incident_number, -- A a.event_date -- A ) link_previous -- A where -- B previous_event.incident_number = link_previous.incident_number and -- B previous_event.event_date = link_previous.previous_date -- B ) previous -- B where -- D incident.incident_number = this.incident_number and -- D this.incident_number = link_next.incident_number(+) and -- D this.event_date = link_next.current_date(+) and -- D this.incident_number = previous.incident_number(+) and -- D this.event_date = previous.current_date(+); -- D
A Procedural, Record-At-A-Time, PL/SQL Solution
Figure 6 shows a solution written in PL/SQL.
For the sake of keeping the example short, to make it easier to study, most of the comment lines and all of the exception handling have been removed. This would obviously not be acceptable for a real solution.
Even with the comments and exception handling removed, this solution is longer than the set-level solution. In itself, this is not necessarily a problem.
This procedure works by holding the previous, current and next events in memory and then generating the target INCIDENT_EVENT rows from this stored data.
We read an incident and clear the previous, current and next events in memory. Then we read events for this incident, in event_date order. Each time we read an event, we copy the current event to the previous, copy the next event to the current, and copy the new event into next. We can then write an INCIDENT_EVENT record.
We do not write an INCIDENT_EVENT record after reading the first event because we then have something in next, but nothing in current or previous.
On reading the second event we have nothing in previous, and something in both current and next, so we can write the first INCIDENT_EVENT record for the incident.
For the third and subsequent events we have data in previous, current and next.
Finally, after reading all the events for an incident, we still have to write the last INCIDENT_EVENT record, which has data for previous and current, but not next.
The “logically current” and “logically previous” comments in the section which writes out the last event for each incident, show that we have optimised this section by not copying current to previous and next to current.
Figure 6: A Procedural, Record-At-A-Time Approach
create or replace procedure pop_incident_event as -- cursor c_incident is select incident_number, priority from incident; -- cursor c_event (cp_incident_number incident.incident_number%type) is select event_date, status from event where incident_number = cp_incident_number order by event_date; -- r_current c_event%rowtype; r_previous c_event%rowtype; r_next c_event%rowtype; v_incident_number incident.incident_number%type; v_priority incident.priority%type; -- begin -- for ci in c_incident loop -- r_current.event_date := null; r_current.status := null; r_previous.event_date := null; r_previous.status := null; r_next.event_date := null; r_next.status := null; v_incident_number := ci.incident_number; v_priority := ci.priority; -- for ce in c_event(v_incident_number) loop -- r_previous.event_date := r_current.event_date; r_previous.status := r_current.status; r_current.event_date := r_next.event_date; r_current.status := r_next.status; r_next.event_date := ce.event_date; r_next.status := ce.status; -- if r_current.event_date is not null then insert into incident_event ( incident_number, priority, start_date, end_date, previous_status, current_status ) values ( v_incident_number, v_priority, r_current.event_date, r_next.event_date, r_previous.status, r_current.status ); end if; end loop; -- event -- if r_next.event_date is not null then -- logically current.event_date insert into incident_event ( incident_number, priority, start_date, end_date, previous_status, current_status ) values ( v_incident_number, v_priority, r_next.event_date, -- logically current null, r_current.status, -- logically previous r_next.status -- logically current ); end if; end loop; -- incident end;
Which Method is Better?
The important questions are:
- which method performs best
- which method is easier to understand
- which method is easier to maintain
Performance
In terms of performance, there is no contest; the SQL version is three times faster (see Figure 7. Also, to get the PL/SQL version to run in any acceptable time it requires an index on EVENT; the SQL version requires no indexes (it uses HASH_JOINS if the tables are analyzed).
Ease of Understanding
I believe that the question of which method is easier to understand depends on training and experience. If you have been brought up to think in procedural algorithms (as I was in my early career) then the procedural approach makes more sense. However, I find now, having spent many years thinking in set-level terms when working with relational databases, that the set manipulation concepts become easier to grasp and eventually are second nature.
Ease of maintenance
Ease of maintenance depends largely on ease of understanding. There are some other factors, though. Maintenance consists of functionality changes and performance tuning. SQL can be tuned by collecting statistics on the underlying tables (using Oracle’s ANALYZE TABLE … command), creating indexes and adding optimiser hints. None of these require code to be restructured. Tuning PL/SQL usually does require code to be restructured and is therefore more costly.
Summary
For bulk transformation functions, non-procedural SQL usually runs faster than procedural PL/SQL code.
Non-procedural solutions are usually easier to write, contain less code and are less prone to errors
When using SQL, all join, restriction, aggregation and navigation logic is performed by the Oracle server, which is highly optimised for these functions and nearly always performs them more efficiently than a PL/SQL program can.