Data warehousing with Data Transformation Services
(LeRoy Tuttle)
SQL Server 2000 has all of the tools necessary to build a data warehouse and
the most powerful of these is the DTS (Data Transformation Services). The
DTS exists to pull data from the transactional systems and put it into the
data warehouse...or rather into the staging area. And that qualification
sums up neatly what this talk was all about.
Creating maintainable data warehouses is a
complex process and there are many pitfalls for the unwary. This talk was
more about good practices in the ETL (Extract Transform Load) process than a
description of the DTS itself.
For example, always use a staging area, don't
import data directly into the data warehouse. The staging area (which can be
held in memory, on disk, or even on a separate machine) is used for
consolidating data as it arrives - the staging area should be used as an
area to perform all of the data transformations and aggregations. The schema
is typically a mirror of that in the warehouse.
Slowly changing dimensions need to be handled
carefully. There are two main ways of dealing with changing dimensions:
-
Type 1 - Overwrite the previous values. This is simple, but the history is
lost.
-
Type 2 - Create a new row. This is more complex, and needs a
current_Indicator to point to the new row, but keeps the history.
Identifying type 2 changes can be very slow because it can involve a
pair-wise comparison of every type 2 column on every row. So instead use the
checksum feature in SQL 2000. Compute a checksum value based on all columns
for each row as it comes into the staging area. Then any row with a changed
value can be identified by its altered checksum value. For example:
INSERT INTO
STAGING_DIM_PRODUCT_TYPE2_CHANGES
SELECT * FROM
STAGING_DIM_PRODUCT STAGE, DIM_PRODUCT DIM
WHERE STAGE.RECORD_EXISTS = 'Y'
AND STAGE.PRODUCT_ID = DIM.PRODUCT_ID
AND DIM.CHECKSUM_VALUE <>
CHECKSUM (STAGE.CATEGORY,STAGE.UNIT_PRICE,STAGE.QUANTITY)
DTS logging isn't on by default, but it is seriously useful and a very
common warehousing requirement. It will show you all the detail you are
likely to need (total number of error rows encountered, current
transformation phase, and so on). This is all covered in books on line.
Other tips included:
-
Never, ever, put a pop-up box in a transformation. When it pops up it will
wait for you to click on 'OK'...and wait and wait...
-
Always use redundant connections to a data source - if you have three
transformation processes, make three separate connections to the same data
source. Otherwise the processes will serialise.
-
DTS data pump: this can process row by row or it can used for set
operations. Bear in mind that set operations are much faster.
The take home message was that the DTS doesn't just transform data, it can
(and should be) used to manage the entire ETL workflow.
Mark Whitehorn
|
|