Data Transformation – Procedural & Non Procedural Solutions

January 1, 2004 datamgmt No Comments

This paper looks at a somewhat awkward data transformation, and at solutions written in SQL and in a procedural language. It describes some techniques which can be used to develop the solution in both languages. It also compares the solutions in terms of ease of development, performance and cost of maintenance.

Are we building transformations in the most effective way?

When working with other ETL developers, using the Oracle toolset to develop transformations, I often find that there is an assumption that the solution will be written in PL/SQL. PL/SQL is a very powerful language, and I have not found a transformation problem yet which could not be solved using PL/SQL. However, I believe that SQL – without the procedural framework provided by PL/SQL – is usually a better tool to use.

This paper looks at a somewhat awkward transformation, and at solutions written in SQL and PL/SQL. It describes some techniques which can be used to develop the solution in both languages. It also compares the solutions in terms of ease of understanding (so, how easy they are to develop and maintain) and performance. The underlying difference between the two approaches is that SQL encourages us to think in terms of sets of data, and PL/SQL is more amenable to record-at-a-time processing.

Figure 1: A typical Transformation

An example transformation for comparing two approaches

Figure 1 shows a typical transformation which an ETL developer might have to implement. This looks quite simple, but the business rules defining the transformation make the job more complicated.

Figures 2, 3 and 4 show some sample data for one incident, in the source tables INCIDENT and EVENT and in the resulting target table INCIDENT_EVENT.

Figure 2: Source Table INCIDENT sample data

Figure 3: Source Table EVENT sample data

Figure 4: The Required Target Table INCIDENT_EVENT

Source and Target Incident-Event Models

There are a series of events for each incident. The join between the source INCIDENT and EVENT tables is simply on incident_number. Then to produce the target INCIDENT_EVENT it is necessary to think of the events being ordered by event_date within each incident (as shown in Figure 3). In the description which follows, the “current” event is whichever event we are currently looking at; the “previous” event is the event for the same incident which immediately precedes it in time, and the “next” event is the one which immediately follows it in time.

To create a row in INCIDENT_EVENT, we take INCIDENT_NUMBER and PRIORITY from INCIDENT. The START_DATE is the EVENT_DATE of the “current” event. END_DATE is the EVENT_DATE of the “next” event. PREVIOUS_STATUS is the STATUS of the “previous” event and CURRENT_STATUS is the STATUS of the current event.

The first event for an INCIDENT has no “previous” event, so PREVIOUS_STATUS is null for this row in INCIDENT_EVENT.

The last event for an INCIDENT has no “next” event, so END_DATE is null for this row in INCIDENT_EVENT.

Procedural or non-procedural?

The description of this transformation in terms of “an ordered list of events for each incident” might lead us to a procedural, record-at-a-time, solution. While such a solution will work, I would argue that a non-procedural, set-level solution is better.

To explore this argument, let’s take a look at both solutions.

The Non-Procedural, Set-Level, SQL Solution

Figure 5 shows a non-procedural, set-level solution, written in Oracle SQL. This looks quite complicated, so let’s break it down into more manageable pieces.

This solution makes extensive use of Oracle “inline views”, in which a sub-query takes the place of a table in the “from” clause. Other databases also support this functionality; in SQL Server, for example, they are called “derived tables”. Inline views help to partition the rules which are implemented by a complex query. Inline views can also provide solutions with good performance, especially for the bulk operations which are common in data transformation.

The section labelled with the comment ‘– A’ in Figure 5 finds the previous event for the current incident and the current (“this”) event. The previous event is the one with the maximum date which is less than the date of the current event. This sub-query returns an “inline view” called “link_previous”. If the current event is the earliest for the current incident the link_previous inline view will return no rows.

The sections labelled ‘– B’ in Figure 5 show the “previous” inline view, which gets the status of the previous event which we found in the “link_previous” inline view. We need to use nested views because the status is not one of the “group by” columns in the link_previous view.

The section labelled ‘– C’ in Figure 5 finds the next event for the current incident and the current (“this”) event. The next event is the one with the minimum date which is greater than the date of the current event. This sub-query returns an “inline view” called “link_next”. If the current event is the latest for the current incident the link_next inline view will return no rows.

The section labelled ‘– D’ in Figure 5 joins INCIDENT, the current event (EVENT), the previous event (previous) and the next event (link_next). There are outer joins to “previous” and “link_next” so that we get the first event (which does not have a previous event), and the last event (which does not have a next event) for each incident.

Figure 5: A Non-Procedural, Set-Level Solution

insert into incident_event 
(
    incident_number,
    priority,
    start_date,
    end_date,
    previous_status,
    current_status
)
select
    incident.incident_number   incident_number,
    incident.priority          priority,
    this.event_date            start_date,
    link_next.next_date        end_date,
    previous.previous_status   previous_status,
    this.status                current_status
from
    incident incident,
    event    this,
    (                                                  -- C
        select                                         -- C
            a.incident_number,                         -- C
            a.event_date       current_date,           -- C
            min(b.event_date)  next_date               -- C
        from                                           -- C
            event a,                                   -- C
            event b                                    -- C
        where                                          -- C
            a.incident_number = b.incident_number and  -- C
            a.event_date < b.event_date                -- C
        group by                                       -- C
            a.incident_number,                         -- C
            a.event_date                               -- C
    ) link_next,                                       -- C
    (
        select                                                                 -- B
            previous_event.incident_number,                                    -- B
            link_previous.current_date  current_date,                          -- B
            previous_event.status       previous_status                        -- B
        from                                                                   -- B
            event previous_event,                                              -- B
            (                                                  -- A
                select                                         -- A
                    a.incident_number,                         -- A
                    a.event_date       current_date,           -- A
                    max(b.event_date)  previous_date           -- A
                from                                           -- A
                    event a,                                   -- A
                    event b                                    -- A
                where                                          -- A
                    a.incident_number = b.incident_number and  -- A
                    a.event_date > b.event_date                -- A
                group by                                       -- A
                    a.incident_number,                         -- A
                    a.event_date                               -- A
            ) link_previous                                    -- A
        where                                                                   -- B
            previous_event.incident_number = link_previous.incident_number and  -- B
            previous_event.event_date      = link_previous.previous_date        -- B
    ) previous                                                                  -- B
where                                                                   -- D
    incident.incident_number   = this.incident_number and               -- D
    this.incident_number       = link_next.incident_number(+) and       -- D
    this.event_date            = link_next.current_date(+) and          -- D
    this.incident_number       = previous.incident_number(+) and        -- D
    this.event_date            = previous.current_date(+);              -- D

A Procedural, Record-At-A-Time, PL/SQL Solution

Figure 6 shows a solution written in PL/SQL.

For the sake of keeping the example short, to make it easier to study, most of the comment lines and all of the exception handling have been removed. This would obviously not be acceptable for a real solution.

Even with the comments and exception handling removed, this solution is longer than the set-level solution. In itself, this is not necessarily a problem.

This procedure works by holding the previous, current and next events in memory and then generating the target INCIDENT_EVENT rows from this stored data.

We read an incident and clear the previous, current and next events in memory. Then we read events for this incident, in event_date order. Each time we read an event, we copy the current event to the previous, copy the next event to the current, and copy the new event into next. We can then write an INCIDENT_EVENT record.

We do not write an INCIDENT_EVENT record after reading the first event because we then have something in next, but nothing in current or previous.

On reading the second event we have nothing in previous, and something in both current and next, so we can write the first INCIDENT_EVENT record for the incident.

For the third and subsequent events we have data in previous, current and next.

Finally, after reading all the events for an incident, we still have to write the last INCIDENT_EVENT record, which has data for previous and current, but not next.

The “logically current” and “logically previous” comments in the section which writes out the last event for each incident, show that we have optimised this section by not copying current to previous and next to current.

Figure 6: A Procedural, Record-At-A-Time Approach

create or replace procedure pop_incident_event as
--
cursor c_incident is
select
    incident_number,
    priority
from
    incident;
--
cursor c_event (cp_incident_number incident.incident_number%type) is
select
    event_date,
    status
from
    event
where
    incident_number = cp_incident_number
order by
    event_date;
--
r_current           c_event%rowtype;
r_previous          c_event%rowtype;
r_next              c_event%rowtype;
v_incident_number   incident.incident_number%type;
v_priority          incident.priority%type;
--
begin
--
    for ci in c_incident loop
--
        r_current.event_date := null;
        r_current.status := null;
        r_previous.event_date := null;
        r_previous.status := null;
        r_next.event_date := null;
        r_next.status := null;
        v_incident_number := ci.incident_number;
        v_priority := ci.priority;
--
        for ce in c_event(v_incident_number) loop
--
            r_previous.event_date := r_current.event_date;
            r_previous.status := r_current.status;
            r_current.event_date := r_next.event_date;
            r_current.status := r_next.status;
            r_next.event_date := ce.event_date;
            r_next.status := ce.status;
--
            if r_current.event_date is not null then
                insert into incident_event
                (
                    incident_number,
                    priority,
                    start_date,
                    end_date,
                    previous_status,
                    current_status
                ) 
                values
                ( 
                    v_incident_number,
                    v_priority,
                    r_current.event_date,
                    r_next.event_date,
                    r_previous.status,
                    r_current.status
                );
            end if;
        end loop;      -- event
--
        if r_next.event_date is not null then   -- logically current.event_date
            insert into incident_event
            (
                incident_number,
                priority,
                start_date,
                end_date,
                previous_status,
                current_status
            ) 
            values
            ( 
                v_incident_number,
                v_priority,
                r_next.event_date,              -- logically current
                null,
                r_current.status,               -- logically previous
                r_next.status                   -- logically current
            );
        end if;
    end loop;    -- incident
end;

Which Method is Better?

The important questions are:

which method performs best
which method is easier to understand
which method is easier to maintain

Performance

In terms of performance, there is no contest; the SQL version is three times faster (see Figure 7. Also, to get the PL/SQL version to run in any acceptable time it requires an index on EVENT; the SQL version requires no indexes (it uses HASH_JOINS if the tables are analyzed).

Figure 7: Performance

Ease of Understanding

I believe that the question of which method is easier to understand depends on training and experience. If you have been brought up to think in procedural algorithms (as I was in my early career) then the procedural approach makes more sense. However, I find now, having spent many years thinking in set-level terms when working with relational databases, that the set manipulation concepts become easier to grasp and eventually are second nature.

Ease of maintenance

Ease of maintenance depends largely on ease of understanding. There are some other factors, though. Maintenance consists of functionality changes and performance tuning. SQL can be tuned by collecting statistics on the underlying tables (using Oracle’s ANALYZE TABLE … command), creating indexes and adding optimiser hints. None of these require code to be restructured. Tuning PL/SQL usually does require code to be restructured and is therefore more costly.

Summary

For bulk transformation functions, non-procedural SQL usually runs faster than procedural PL/SQL code.

Non-procedural solutions are usually easier to write, contain less code and are less prone to errors

When using SQL, all join, restriction, aggregation and navigation logic is performed by the Oracle server, which is highly optimised for these functions and nearly always performs them more efficiently than a PL/SQL program can.

Data Management & Warehousing