6
ETL – DataTransformations
Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Microsoft Corporation Technical Documentation License Agreement (Standard)
READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS.
1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows:
(a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly;
(b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or
(c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential.
2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else.
3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM.
4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party.
6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA.
7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
In this Chapter
●General Transformation Problems and Solutions
●Guidelines for Transformation process
●Data Quality Issus and Solutions
Overview
Data transformation is key process in ETL cycle. Transformation is the phase where in we transform the raw data into a more meaningful form meeting business analysis objectives. The following diagram shows the life cycle for transformation.
In brief of various phases of Transformation process is explained below.
Validate Extract: This is the first step after extraction to validate the data for the recount count, exceptions, and partially extracted records.
Map: Mapping is the activity to map the extracted data to the target, which is warehouse or data mart. This mapping can be 1 – 1, many to 1 or 1 to many columns.
Reformat:Reformat is performed based on the mapping. The staging schema is checked for the data type, data content, and business requirement as required for the target column. This phase related to only formatting of the staging data to fit exactly into the structure of target warehouse.
Cleanse: In this phase, the incoming data is cleaned for data quality problems such as redundancy, constraints and other data consistency problems
Integrate: This phase is for integrating data from multiple systems, from archive and current data stores.
Integrate: It is the integration phase where the cleansed data is aggregated (if warehouse is higher granular), integrated from multiple tables based on mapping for creating loading readydata set.
Apply Business rules: Business rules are applied for certain calculations and deriving meaningful and required data from extracted data
Need of Transformation
First you need to decide whether you need to perform transformations at all. The following factors are major reasons for creating transformations:
Transformation Process is required
The following section lists high level factors that indicate the need for data transformation.
- Data Quality problems. Extracted data from source data stores may have data quality issues such as repeated Primary Key values, Nulls that are not suitable in a data warehouse, incorrect date formats, and referential integrity failures, cryptic codes values for some fields, inaccurate data, and data values that are inconsistent and redundant between multiple tables. The problems of data quality, guidelines and solutions are explained in Data Quality section. The major sources data quality problems are data entry errors, source system errors, normalization problems, and source database design flaws.
- Multiple Data Sources. When data is extracted from multiple data sources, more than one system can update a single logical set of data that is commonly shared among the data sources. Managing discrepancies that result from using individual data islands as the source requires active transformations.
- Data type inconsistency. Data type inconsistencies come from staging tables that are not compatible with the source systems. This is particularly problematic if one staging table is populated from heterogeneous systems such as Excel and SQL Server, where the data sources have different data types.
- Business Rules. When there is a business rules such as the requirement of a minimum balance for a savings account, and the rule is not enforced at the source, then it should be handled as part of the transformation process.
- Heterogeneous data stores. When the data is extracted from heterogeneous data stores, the source systems may have different data types, different column names, different column orders, and different data formats, particularly for date time values. These inconsistencies can be resolved during the transformation process.
- Timely. Incase of transaction records are coming ahead of its corresponding look up table values. Example: Employee basic information record is not yet available for extracted but admission information is available in transaction tables. This is a typical transactions problem at source systems.
- Custom Calculations. Custom calculations may be needed in the warehouse. The custom calculations can be performed and directly stored in the warehouse tables for query performance OR these calculations are computed while executing the query but later is a time consuming operation.
- Data mapping. If source systems cannot be mapped to the corresponding objects in staging of staging schema design, then required transformation is to be done as part of extraction process before moving the data to staging. For example: if the source contains a customer’s birth date, but you want the age stored in a staging table, the transformation process would take care of the conversion as it loads the staging database.
- Hierarchical sources. It is about the entity identification at different levels of the source. For example, suppose that the central server keeps track of all transaction and look-up tables needed by the warehouse, and that the data stored in that central server is not a real time. Then when you extract data from both the central server and real-time child sources, you may have duplicated data that needs to be managed in a transformation process.
- Record tracking. If there is no record tracking at source, then identifying the Insert, Update, and Delete changes would be part of the data transformation process.
Transformation process can be avoided
Transformation process is not mandatory in the data warehouse loading process. It can be eliminated in the following scenarios. Eliminating transformation layer can reduce performance overhead of loading data to warehouse, so deciding if the transformation is really required in the system would benefit total processing time.
Transformation involves simple data modifications: If there is no need of
- Record tracking. Typically if there is no record tracking mechanism and no transaction identifying mechanism at source then the same is accomplished using transformation layer. In no record tracking scenario at source, all the record are extracted in each extraction process and compared against the warehouse or accumulated staging to determine the kind of transactions happened from last extraction. So, if the extracted records do have the transaction code as well as the process can effectively bring only the delta records (inserted/deleted/updated records from last extraction) then the transformation can be eliminated as the transaction can directly be applied to warehouse without any transformation.
- Data cleansing. If there are no data quality issues and the source already has constraints to keep data consistent and accurate (PK, FK, Check constraints or procedural constraints) then there is no need for separate transformations before loading the data to warehouse.
- Merging source systems. If there is no need of merging data from multiple source systems and multiple sources are not involved to accumulate data at a common place then and the source eliminate duplicates, referential integrity
- Complex decoding. Source system would be embedded with codes that is interpreted or data represented by source application. To have these codes be loaded to warehouse, data representation code them convert to user understandable values. If the decoding process is not complex to have a separate layer of transformation then the layer can be eliminated.
- Complex interpretation of default values. If the default values are just codes from the look up tables or simple codes (such as 1 for male and 2 for female) that can be converted to meaningful words while extracting data. There is no need of separate transformation process after extracting data unless there is a complex process involved to interpret default values. The general default values in the system are typically assume if the column is not allowed for null values, the value is not appropriate for the column, incorrect entry, not available yet.
- Custom calculations. If there are no complex calculations involved before loading the data to warehouse, the transformation layer can be eliminated. Simple calculations such as simple arithmetic calculations can be incorporated in extraction procedures.
Transformation Process
Transformation process involves the following steps at a conceptual level:
- Study the source system
- Location of Transformation Process
- Validate Data Extract (Selection)
- Delta Detection
- Continuity of Delta
- Count Check
- Capture bad records
- Mapping
- Data mapping document preparation
- Reformat
- Conversion
- Data Cleansing
- Data Quality checks
- Splitting/Joining
- Integrate
- Summarization
- Enrichment
- Applying Business Rules
Study the source system
Source System, which is a data source for the extraction system is to be studied from business domain point of view as well as the structural and integrity of the source system. Source system is not assumed to be a relational data store, it can be flat files or another database apart from SQL Server. In this section, a list of questions (can be treated as check lit) that need to discussed with source system (OLTP) database architect.
Studying the source system is not a one time activity, it is to be performed whenever
- Extraction process is enhanced
- Source business process is changed
- Source schema is changed
- New business functionality is to be extracted
- Existing data checks/constraints on the source system are modified
The check list which should be discussed is as given below
- Check how the data is fed into data source from which extraction is performed, any external applications or data is fed using the flat files or heterogeneous systems
- This helps the understand the complexity to deal with while interacting with the source system
- Typically structural complexity increases when you lack details of the source system or have less control over it
- Incase of multiple data sources involved in extraction, it would result in more than one extraction process is needed to extract data and consolidation complexity
- Extracting information if the data is fed to the source system by application would enable to check if data input screens are free text in nature where the data pollution may occur
- Discuss the data consistency or data quality checks performed on the source system
- Mapping your warehouse requirement against the checks done at the source to see if the data checks will completely stop the polluted data to get into source system
- As part of learning the source system – you need understand at what level of details and how much time you are going to invest in leaning the source system. You may or may not have the luxury to learn end-to-end source system (all the layers from UI to database schemes and all the dependencies). Sometimes, it might be as simple as evaluating the schema or sometimes understand how the user enters data
- Check the integrity constraints managed in source system
- Check the constraints enforcement on source system is using declarations or procedures. If the checks are enforced using procedures then check the possibility of data slipped through to cause data pollution
- If the data checks are enforced properly then they do not need to be checked again as part of transformation process
- Verify if the data will be in consistent at the time of data extraction, or at point of time
- In some systems, the processing is completed at the end of the day, extracting data in the middle of the day would make the data inconsistent would lead to data quality problems
- Transfer of check , if the check is process at the end of the day….transaction is completed, but available at the source; at 5 ‘O, Clock….pending transaction – Completeness of the transaction
- Verify the accuracy of the calculations performed on the source system
- Though it is evident that any problem in source system calculations would have been identified and fixed, the question is about the unused calculations whose problem is not identified yet but need to be extracted for analysis requirements
Example, source applications might not be using employee leave balance so calculation problems, if any, are not identified. Having these calculations extracted to warehouse system would induce incorrect calculations. All the calculations that are extracted from source system are to be checked for accuracy.
- Discuss the columns that would allow NULL on source system
Nullability of a column is to be decided by the source systems, when there is no expectation of NULL values from source. Having nullability of a column which never gets null values is an overhead for the table