The ETL Process
The ETL Process
Of all the effort performed before the data warehouse in ready to launch, the average warehouse development team will spend less than 25 percent of its time on designing the warehouse, configuring hardware, installing software products, and the like. Easily 75 percent of the work involved in setting up a data warehouse is devoted to locating, extracting, filtering, scrubbing, and finally loading data. Of these tasks, loading is by far the easiest. Resolving the various problems that arise when we bring together data from multiple source systems is invariably more difficult than we anticipate.
The ETL (Extract, Transformation and Load) process is the foundation of the data warehouse.
A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that it can be loaded into the data warehouse, and supports the end user decision making process.
The ETL system adds significant value to the data. It is far more than a just pipeline for getting the data from the source system to the target data warehouse. Specifically, the ETL system:
· Extract source data from multiple disparate systems
· Loads extracted tables into a staging area
· Removes mistakes and corrects missing data
· Adjusts and integrates data from multiple sources to be used together
· Build normalized tables for any data that is not already normalized
· Extract from staging area, into denormalized tables
· Load the data into the dimension and fact tables of the data warehouse to be used by users.
An ETL system is both simple and complex. Everyone understands the basic mission of the ETL process. That is, extract the data, transform it, and load it into the target database. The complex part of the process however, comes when we try to validate, cleans, and integrate the data from the various disparate source systems. We need to understand the business rules behind the data. We need to understand the different source applications, their business purpose, their distinct underlying software and hardware platforms. There are a thousand little subtasks in the ETL process. All of these tasks are considered back-room operation. The ETL process is the part of the iceberg that you do not see.
A mature ETL environment will be a mixture of flat files, independent relational tables called staging tables, full blown normalized table models, metadata and rules, and maybe some other data structures such as XML files. In all these cases, the purpose of an ETL process is to build data structures purposely built for ease of use by the business user. And of course, the recommendation is to build dimensional models consisting of fact and dimension tables to be loaded in a data warehouse, or a data warehouse consisting of an OLAP cube.
A Diagram of the ETL Process
ETL Tools
Most data warehouse projects will require the use of specialized tools for extracting, transforming, validating, integrating, moving and loading the data from the various source applications to the data warehouse. These tools are referred to generically as ETL tools. The responsibilities of ETL tools go beyond extraction, transformation and load as their acronym implies. These tools also have the ability to validate the data and perform integration.
Manually ETL build - Frequently initial implementations of data warehouse projects do not have the funding for such tools. Admittedly, these tools tend to be expensive. And often organizations decide to save money by not purchasing such software. Such organization opt for performing the ETL process manually through the coding of many data extract scripts, coding of transformation logic and validation programs, and coding the eventual transfer and load scripts. It has been shown that the opportunity cost of such a decision (using in-house programmers and DBAs, or outsourcing to contractors) is greater than the cost associated with purchasing an ETL tool.
ETL tools e.g. Informatica PowerCenter ,Oracle Warehouse Builder and ODI, SAP Data Integrator,
IBM Information Servers, Microsoft SS Integration Services, SAS Data Integration Studio, etc…
ETL tools when properly and effectively used can greatly simplify the tasks necessary to create a data warehouse. ETL tools provide the following benefits:
· Enhanced ease of use, commonly with an advanced graphical user interface for design
· Enforce consistency between iterative phases of design and build
· Change control mechanisms for implementation schema design changes
· Consistent user interface for extracting data, even from multiple environments
· Support for coordinated usage of multiple members of the data warehouse build team
· Documentation of all warehouse metadata
· Metadata coordination with reporting, query, and analysis tools
· Generation of scripts to perform tasks such as creation of tables, data validation, and load
· Scheduling, execution and control of the load and periodic refresh processing so that errors in one step can be addressed prior to the subsequent steps are affected
· Support for data extraction methods such as full extracts as well as incremental extracts
· Extensible support for performing simple as well as more complex transformation of data
Unfortunately, there is no one tool that is best at all of these responsibilities. You need to investigate, research and hands-on evaluate many before deciding on the one that properly supports your needs.
Although there is an advantage to choosing an ETL tool over manually performing the process of extracting, transforming and loading the data, if you are a DBA, you will need to be aware of the steps and the code that is necessary for the creation of warehouse and warehouse tables and other indexes, the code for the extraction of the data from source applications, the code for the transformation, validation and reconciliation of the data, the code necessary to transfer and load the data, and the code that is subsequently needed to refresh the warehouse contents. In addition monitoring and fine tuning the warehouse to ensure optimal performance is also a responsibility of the DBA.
1. Extracting Data
As enterprises evolve, they acquire, build or inherit various computer systems to help them run the business. These systems are typically acquired, built, or inherited at different times, and by different business units making their own decisions on the software/hardware platforms.
The ETL process needs to effectively integrate data that comes from different:
· Hardware
· Operating systems
· Communication protocols
· Database of file management systems
Before we begin the extraction process, we need to develop a logical data map (covered in previous session) that documents the relationship between the original source fields and the final destination fields.
An important element of this map is the data transformation. The data transformation is the guts of the ETL system. This can be expressed in English, using medium to high level definitions, or in detail pseudo-level code. The transformation can also be expressed in the SQL language.
Determining the System-of-Record
You should always attempt to obtain data from the source system. This is the system-of-record.
In some cases it is more convenient to obtain the data from a second or even a third system.
This may be the case if the second or third system improved on the data by cleansing it, integrated it, or simply moved the data from an external or legacy platform into the same platform or database.
If you decide to obtain the data from a secondary system, it is your responsibility to ensure that the data transformation between the system-of-record and the secondary system is valid and that it can be depended on.
Derived Data
What about derived data? Should you accept derived data from the source application or should you re-derive it yourself? You have a choice. It is easier to depend on the source application to derive the data. However if the source application derives the data at a level that is too high for the fact grain of your data warehouse, then you are forced to recreate the derivation yourself.
If you choose to derive the data yourself, then you need to understand the business rules that drive that derivation. It is also important to ensure that your derivation formula stays in sync with the source application. If the source application changes the formula without your knowledge, then the two derivation formulas will be out of sync.
Extracting from Disparate Systems
Each data source can be on a different platform (hardware, operating system, database, etc.) Extracting data from these multiple sources require knowledge of each of those platforms.
1) If the source system uses a modern open platform such as an RDMBS, and the data is accessible to the data warehouse team, then the data warehouse team along with the IT staff responsible for the source system can either write the extract manually using SQL commands or scripts (PL/SQL if Oracle, or Transact/SQL if Sybase or SQL Server), or if an ETL tool is available, use the ETL tool to perform the extract. The extracted data can be loaded directly into staging tables (a staging table is a table that holds a copy of the extracted data), or the extracted data can be loaded into flat files for further manipulation, sorting/merging, cleansing, and filtering before loading into the staging tables
2) If on the other hand the source system runs on a legacy platform such as a mainframe, or mid range CPU, then the best approach is to find the IT specialist responsible for the maintenance of source system, and give the specialist the responsibility for extracting the required data. In this case the extracted data should be in the form of flat files that can be transferred (perhaps using FTP) to the data warehouse computer and either loaded directly into staging tables, or further manipulated, sort/merged, cleansed, and filtered before being loaded into the staging tables.
At the end of either process above, the required data is now in specially designed staging tables, ready to be further cleansed and manipulated.
Things to be aware of when dealing with legacy mainframe data
· Data reside in flat files, VSAM files, or other relational or hierarchical databases. Data encoding uses EBCDIC (Extended Binary Coded Decimal Interchange Code) character set as opposed to ASCII (American Standard Code for Information Interchange). When you transfer the data from a mainframe system to a Unix, PC or Macintosh platform, the FTP tool will automatically convert the characters from one encoding system to another.
· Numeric formats use either string format, packed decimal format, or binary format. In all of these cases it is best to convert those formats to text string format, prior to transferring the data
· Decimal points are implied. It is best to convert decimal fields into a string format with a visible decimal point.
· Some data is redefined. That is data content could be different based on the context of the record. It is best to convert this information to two separate data fields. These will always be mutually exclusive data fields
· Multiple record layouts. Within the same file structure one could have multiple types of records. It is best to create separate output files for each record type.
· Multiple occurs. Multiple data elements (similar to an array) stored horizontally in the one record. If the occurs data mean different things (example order date, ship date, invoice date) then simply create multiple data fields, and name them appropriately. If on the other hand the occurs data means the same thing, then normalize the occurs into multiple row oriented data structure.
· There are no date types. As such date data may need to be validated for correct dates.
Type of Flat Files
Data in the exported flat files can be stored using fixed positional places. Fixed positions identify the start and the end of each data element in the flat file. When viewing the file, all fields line up in column oriented fashion, as the same data field always starts at the same fixed position. When loading this information into a staging table, one must instruct the SQL load script as to the start and end position of each data element.
Flat files can also be field delimited, example comma or tab delimited. In this case each field will be delimited from the next field with a column (tab or other). CSV files are comma delimited files. These files could have been extracted out of critical data that is still maintained in Excel format.
It is important to ensure that you do not use the delimiter within the particular field content.
Example, make sure that there are no commas in the address field, otherwise place data in quotes.
XML files
Recently data exchanges have been using the XML data structure to define and store data content.
XML (eXtensible Markup Language) is surely becoming the standard for describing and sharing data.
XML has emerged to become a universal language for exchanging data between organizations.
If you receive data from an external source, odds are you are receiving or sending the data using XML. If that is not the case, then at least either you or your counterpart has discussed that option.