Click to return Home

  Newsreel
  Products & Services
  Web Watch
  Software Update
  Resource Directory
  Events Diary
  Articles
  The Magazine
  Subscribe
  Contact Us

  Advertisers

TechEd 2000

  TechEd reports
  Exhibitors
  Fun Stuff

 

  Search DNJ Online


 

Data warehousing with Data Transformation Services
(LeRoy Tuttle)

SQL Server 2000 has all of the tools necessary to build a data warehouse and the most powerful of these is the DTS (Data Transformation Services). The DTS exists to pull data from the transactional systems and put it into the data warehouse...or rather into the staging area. And that qualification sums up neatly what this talk was all about.
      Creating maintainable data warehouses is a complex process and there are many pitfalls for the unwary. This talk was more about good practices in the ETL (Extract Transform Load) process than a description of the DTS itself.
      For example, always use a staging area, don't import data directly into the data warehouse. The staging area (which can be held in memory, on disk, or even on a separate machine) is used for consolidating data as it arrives - the staging area should be used as an area to perform all of the data transformations and aggregations. The schema is typically a mirror of that in the warehouse.
      Slowly changing dimensions need to be handled carefully. There are two main ways of dealing with changing dimensions:

  • Type 1 - Overwrite the previous values. This is simple, but the history is lost.

  • Type 2 - Create a new row. This is more complex, and needs a current_Indicator to point to the new row, but keeps the history.

Identifying type 2 changes can be very slow because it can involve a pair-wise comparison of every type 2 column on every row. So instead use the checksum feature in SQL 2000. Compute a checksum value based on all columns for each row as it comes into the staging area. Then any row with a changed value can be identified by its altered checksum value. For example:

INSERT INTO STAGING_DIM_PRODUCT_TYPE2_CHANGES
SELECT * FROM
STAGING_DIM_PRODUCT STAGE, DIM_PRODUCT DIM
WHERE STAGE.RECORD_EXISTS = 'Y'
AND STAGE.PRODUCT_ID = DIM.PRODUCT_ID
AND DIM.CHECKSUM_VALUE <>
CHECKSUM (STAGE.CATEGORY,STAGE.UNIT_PRICE,STAGE.QUANTITY)



DTS logging isn't on by default, but it is seriously useful and a very common warehousing requirement. It will show you all the detail you are likely to need (total number of error rows encountered, current transformation phase, and so on). This is all covered in books on line.
      Other tips included:

  • Never, ever, put a pop-up box in a transformation. When it pops up it will wait for you to click on 'OK'...and wait and wait...

  • Always use redundant connections to a data source - if you have three transformation processes, make three separate connections to the same data source. Otherwise the processes will serialise.

  • DTS data pump: this can process row by row or it can used for set operations. Bear in mind that set operations are much faster.

The take home message was that the DTS doesn't just transform data, it can (and should be) used to manage the entire ETL workflow.

Mark Whitehorn
 

 











Related Links