Data Extraction, Transformation, and Loading Techniques

Introduction

During the ETL process, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database. Many data warehouses also incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading.

In its simplest form, ETL is the process of copying data from one database to another. This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers.

When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. ETL systems vary from data warehouse to data warehouse and even between department data marts within a data warehouse. A monolithic application, regardless of whether it is implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for change necessary in ETL systems. A mixture of tools and technologies should be used to develop applications that each perform a specific ETL task.

The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be automated and operational procedures documented. ETL also changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of modification. A solid, well-designed, and documented ETL system is necessary for the success of a data warehouse project.

Data warehouses evolve to improve their service to the business and to adapt to changes in business processes and requirements. Business rules change as the business reacts to market influences—the data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL implementation must adapt as the data warehouse evolves.

Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and capabilities, and introduces new features that make the development, deployment, and maintenance of ETL processes easier and simpler, and its performance faster.

Top of page

ETL Functional Elements

Regardless of how they are implemented, all ETL systems have a common purpose: they move data from one database to another. Generally, ETL systems move data from OLTP systems to a data warehouse, but they can also be used to move data from one data warehouse to another. An ETL system consists of four distinct functional elements:

• / Extraction
• / Transformation
• / Loading
• / Meta data

Extraction

The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained in the source system. It is common to move historical data that accumulates in an operational OLTP system to a data

warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes.

Data extraction is generally performed within the source system itself, especially if it is a relational database to which extraction procedures can easily be added. It is also possible for the extraction logic to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method of data extraction is for the legacy system to produce text files, although many newer systems offer direct query APIs or accommodate access through ODBC or OLE DB.

Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages.

Transformation

The ETL transformation element is responsible for data validation, data accuracy, data type conversion, and business rule application. It is the most complicated of the ETL elements. It may appear to be more efficient to perform some transformations as the data is being extracted (inline transformation); however, an ETL system that uses inline transformations during extraction is less robust and flexible than one that confines transformations to the transformation element. Transformations performed in the OLTP system impose a performance burden on the OLTP database. They also split the transformation logic between two ETL elements and add maintenance complexity when the ETL logic changes.

Tools used in the transformation element vary. Some data validation and data accuracy checking can be accomplished with straightforward Transact-SQL code. More complicated transformations can be implemented using DTS packages. The application of complex business rules often requires the development of sophisticated custom applications in various programming languages. You can use DTS packages to encapsulate multi-step transformations into a single task.

Listed below are some basic examples that illustrate the types of transformations performed by this element:

Data Validation

Check that all rows in the fact table match rows in dimension tables to enforce data integrity.

Data Accuracy

Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

Data Type Conversion

Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on".

Business Rule Application

Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields.

Loading

The ETL loading element is responsible for loading transformed data into the data warehouse database. Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load. The data warehouse is often taken offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools for data loading operations. The design of the loading element should focus on efficiency and performance to minimize the data warehouse offline time. For more information and details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."

Meta Data

The ETL meta data functional element is responsible for maintaining information (meta data) about the movement and transformation of data, and the operation of the data warehouse. It also documents the data mappings used during the transformations. Meta data logging provides possibilities for automated administration, trend prediction, and code reuse.

Examples of data warehouse meta data that can be recorded and used to analyze the activity and performance of a data warehouse include:

• / Data Lineage, such as the time that a particular set of records was loaded into the data warehouse.
• / Schema Changes, such as changes to table definitions.
• / Data Type Usage, such as identifying all tables that use the "Birthdate" user-defined data type.
• / Transformation Statistics, such as the execution time of each stage of a transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on.
• / DTS Package Versioning, which can be used to view, branch, or retrieve any historical version of a particular DTS package.
• / Data Warehouse Usage Statistics, such as query times for reports.

Top of page

ETL Design Considerations

Regardless of their implementation, a number of design considerations are common to all ETL systems:

Modularity

ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and makes them easy to modify when implementing changes in response to business and data warehouse changes. Monolithic systems should be avoided.

Consistency

ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire data load should be treated as a single logical transaction—either the entire data load is successful or the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in others it is a series of transactions. Regardless of the physical implementation, the data load should be treated as a single logical transaction.

Flexibility

ETL systems should be developed to meet the needs of the data warehouse and to accommodate the source data environments. It may be appropriate to accomplish some transformations in text files and some on the source data system; others may require the development of custom applications. A variety of technologies and techniques can be applied, using the tool most appropriate to the individual task of each ETL functional element.

Speed

ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is governed by data warehouse and source system schedules. Some data warehouse elements may have a huge processing window (days), while others may have a very limited processing window (hours). Regardless of the time available, it is important that the ETL system execute as rapidly as possible.

Heterogeneity

ETL systems should be able to work with a wide variety of data in different formats. An ETL system that only works with a single type of source data is useless.

Meta Data Management

ETL systems are arguably the single most important source of meta data about both the data in the data warehouse and data in the source system. Finally, the ETL process itself generates useful meta data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this chapter.

Top of page

ETL Architectures

Before discussing the physical implementation of ETL systems, it is important to understand the different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified in two architectures: the homogenous architecture and the heterogeneous architecture.

Homogenous Architecture

A homogenous architecture for an ETL system is one that involves only a single source system and a single target system. Data flows from the single source of data through the ETL processes and is loaded into the data warehouse, as shown in the following diagram.

Most homogenous ETL architectures have the following characteristics:

• / Single data source: Data is extracted from a single source system, such as an OLTP system.
• / Rapid development: The development effort required to extract the data is straightforward because there is only one data format for each record type.
• / Light data transformation: No data transformations are required to achieve consistency among disparate data formats, and the incoming data is often in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs and other formatting transformations.
• / Light structural transformation: Because the data comes from a single source, the amount of structural changes such as table alteration is also very light. The structural changes typically involve denormalization efforts to meet data warehouse schema requirements.
• / Simple research requirements: The research efforts to locate data are generally simple: if the data is in the source system, it can be used. If it is not, it cannot.

The homogeneous ETL architecture is generally applicable to data marts, especially those focused on a single subject matter.

Heterogeneous Architecture

A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as shown in the following diagram. The complexity of this architecture arises from the fact that data from more than one source must be merged, rather than from the fact that data may be formatted differently in the different sources. However, significantly different storage formats and database schemas do provide additional complications.

Most heterogeneous ETL architectures have the following characteristics:

• / Multiple data sources.
• / More complex development: The development effort required to extract the data is increased because there are multiple source data formats for each record type.
• / Significant data transformation: Data transformations are required to achieve consistency among disparate data formats, and the incoming data is often not in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs, additional data formatting, data conversions, lookups, computations, and referential integrity verification. Precomputed calculations may require combining data from multiple sources, or data that has multiple degrees of granularity, such as allocating shipping costs to individual line items.
• / Significant structural transformation: Because the data comes from multiple sources, the amount of structural changes, such as table alteration, is significant.
• / Substantial research requirements to identify and match data elements.

Heterogeneous ETL architectures are found more often in data warehouses than in data marts.

Top of page

ETL Development

ETL development consists of two general phases: identifying and mapping data, and developing functional element implementations. Both phases should be carefully documented and stored in a central, easily accessible location, preferably in electronic form.

Identify and Map Data

This phase of the development process identifies sources of data elements, the targets for those data elements in the data warehouse, and the transformations that must be applied to each data element as it is migrated from its source to its destination. High level data maps should be developed during the requirements gathering and data modeling phases of the data warehouse project. During the ETL system design and development process, these high level data maps are extended to thoroughly specify system details.

Identify Source Data

For some systems, identifying the source data may be as simple as identifying the server where the data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself, such as decoding rules (O = On, for example), or even detailed documentation of a source system for which the system documentation has been lost or is not current.

Identify Target Data

Each data element is destined for a target in the data warehouse. A target for a data element may be an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an aggregation table. There may not be a one-to-one correspondence between a source data element and a data element in the data warehouse because the destination system may not contain the data at the same granularity as the source system. For example, a retail client may decide to roll data up to the SKU level by day rather than track individual line item data. The level of item detail that is stored in the fact table of the data warehouse is called the grain of the data. If the grain of the target does not match the grain of the source, the data must be summarized as it moves from the source to the target.

Map Source Data to Target Data

A data map defines the source fields of the data, the destination fields in the data warehouse and any data modifications that need to be accomplished to transform the data into the desired format for the data warehouse. Some transformations require aggregating the source data to a coarser granularity, such as summarizing individual item sales into daily sales by SKU. Other transformations involve altering the source data itself as it moves from the source to the target. Some transformations decode data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If two source systems encode data destined for the same target differently (for example, a second source system uses Yes and No for status), a separate transformation for each source system must be defined. Transformations must be documented and maintained in the data maps. The relationship between the source and target systems is maintained in a map that is referenced to execute the transformation of the data before it is loaded in the data warehouse.

Develop Functional Elements

Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading, and meta data logging, vary from system to system. There will often be multiple versions of each functional element.