OWB Best Practices
Author: Oracle SSI BI&DW team
Creation Date: November 8, 2002
Last Updated: December 26, 2002
Version: 1.0
Document Control
Change Record
Date / Author / Version / Change Reference /12-Nov-02 / Dileep Vasu / 1.0 / No Previous Document
Reviewers
Name / Position /Distribution
Copy No. / Name / Location /1 / Library Master / Project Library
2 / Project Manager
3
4
Contents
Document Control ii
Introduction 4
ETL Techniques, Hints and Tips 5
Implementing Slowly Changing Dimension (SCD) using OWB 5
Error Messages and Resolutions 12
Mapping doesn’t open, “java.lang.NullPointerException” error displayed 12
“Unable to generate stable set of records” error occurs in a MERGE load 12
ORA-0600 error occurs in a MERGE load 12
Appendix A 13
The RESTORE procedure 13
6
ETL Techniques, Hints and Tips
OWB Best Practices
Introduction
Oracle9i Warehouse Builder (OWB) is a tool to enable the design and deployment of enterprise data warehouses, data marts and e-Business Intelligence applications. The tool is a component of the Oracle9i Development Suite and is part of the overall Oracle tools offering. Warehouse Builder allows users to design their own Business Intelligence application from start to finish. Dimensional design, ETL process design, extraction from disparate source systems, extensive metadata reporting and integration with Oracle 9iAS Discoverer, Oracle Workflow and Oracle Enterprise Manager enable an integrated Business Intelligence solution with Warehouse Builder at the core. (From the OWB FAQ)
This document draws on the experience of BI&DW consultants at Oracle Solution Services, India in using OWB as the primary ETL tool in various projects, which range from integrating SAP with Oracle Apps EDW to building packaged business intelligence solutions.
ETL Techniques, Hints and Tips
This section discusses techniques for using OWB to address certain requirements, and hints and tips to reduce or speed up development effort.
Implementing Slowly Changing Dimension (SCD) using OWB
About Slowly Changing Dimensions
When implementing data warehouses, there may be a requirement for maintaining a history of changes that have occurred to a dimension over the course of time. For example, consider a star schema with a sales fact and product, time and geography dimensions. The details that are available about the products of the organization are Product SKU Number, Product Name and Product Description. The end users of the organization would like to compare the sales of their products before and after their names are changed during promotions.
In a typical warehouse dimension load, when new details of a record already existing in the dimension are loaded, they overwrite whatever information exists previously. In our example, only the final name of the product would be available for reporting. In data warehousing terminology, such a dimension, where only the last values are maintained, is called a Slowly Changing Dimension Type I.
If history is to be maintained, the major issue is that when a product’s name is changed in the source (OLTP) system, the SKU number, which is the unique identifier of the record, remains the same. So, in addition to the SKU code, we add another unique identifier to the record in the data warehouse. Such a unique identifier or key that is specifically maintained by the data warehouse is called a surrogate key[1]. The unique identifier in the source OLTP system (in our example, the SKU code) is called a natural key. We define an SCD2 attribute to be an attribute of a dimension for which we want to capture history (Product Name in the example).
When transactions are to be loaded into fact tables from the OLTP system (i.e. sales transactions in our example), they must look up the dimension table to find their corresponding surrogate key values before they are loaded into the fact table.
With surrogate keys, it becomes possible to capture history for dimensions as follows. Such an implementation is called a Slowly Changing Dimension Type II.
The first time the dimension is loaded, the product’s details are inserted into the dimension table with a generated surrogate key.
If the product’s name changes in a subsequent load, a new record is inserted in the dimension table, with a new surrogate key value.
In the method described here, we will make use of the LAST_UPDATE_DATE timestamp of the source record to determine which dimension record must be selected during the fact load.
Source and Target
In this example, the source table contains the SKU number, the name and description of all products. It also has the OLTP timestamp for when the record was created or last updated, the LAST_UPDATE_DATE column. This data is to be loaded into the Product Dimension. While the SKU number is the natural key, the PR_ID column in the dimension table is the surrogate key.
The Product Name column is an SCD2 attribute; that is, if the name of the product changes, a new record for the product is inserted into the dimension table with a new surrogate key. The Product Description is not an SCD2 attribute, so if the product’s description changes, it simply updates the latest record with the new product description.
The Start Date and End Date columns in the dimension hold the dates for which that particular record is valid. Whenever a product record is inserted into the dimension, the Start Date timestamp is the last_update_date from the source table. When the SCD2 attribute (Product Name) changes, the earlier record’s End Date should be updated with the last_update_date of the new record.
Mapping
The mapping implementing SCD is shown below:
First, we outer-join the dimension table to the source table to select all the records that are going to be newly inserted and those that require SCD processing. The join condition is based on the natural key. In this example, the join condition is:
SRC.PRODUCT_SKU_NUMBER = DIM.PR_SKU_NUMBER (+)
AND DIM.PR_END_DATE IS NULL
The result OUTGRP of this join contains columns from both the dimension as well as the source table.
Next, we split the records into two groups:
INSERT_UPDATE: The records in this group are those that would be inserted (new records) or would update the existing records (the source records that have changes only in the non-SCD2 attributes). The split condition here is:
INGRP1.DIM_PR_SKU_NUMBER IS NULL
OR (INGRP1.DIM_PR_SKU_NUMBER IS NOT NULL
AND INGRP1.SRC_PRODUCT_NAME = INGRP1.DIM_PR_NAME)
SCD: The records in this group are the ones that satisfy the SCD condition (i.e. a record with the same SKU number already exists in the dimension, but with a different product name). Thus, the split condition for this group is:
INGRP1.SRC_PRODUCT_SKU_NUMBER = INGRP1.DIM_PR_SKU_NUMBER
AND INGRP1.SRC_PRODUCT_NAME != INGRP1.DIM_PR_NAME
These are then brought into three groups in a UNION ALL set operation:
The first group is mapped directly from the source table columns in the INSERT_UPDATE group in the Splitter. Now if the record is to be an update, we must retain the existing Start Date in the dimension. If the record is a new record and has to be inserted, it must be inserted with the last update date in the source table as the Start Date. To take care of both the cases, we introduce an NVL expression such that if the dimension table’s Start Date column is null (because of the outer-join), the Last Update Date from the source column should be selected. In effect, the expression is:
NVL(DIM_PR_START_DATE, SRC_LAST_UPDATE_DATE)
This is mapped to the PR_START_DATE column in the UNION ALL input group.
The second group contains the fields of the SCD records that are to be inserted into the dimension. These fields are mapped from the columns that result from the join (in the first step) that are from the source table.
The third group contains the details of the old record in the dimension that is to be updated with the End Date. So, all the fields are mapped from the fields in the dimension (that result from the join in the first step). The End Date field is mapped from the LAST_UPDATE_DATE in the source table column.
The output group of the UNION ALL is mapped to the dimension table. The output group contains all the columns of the dimension, except the surrogate key column (PR_ID). The surrogate key is generated by a sequence (only) when a new record is loaded.
The load type of this mapping is INSERT/UPDATE. The update matching/loading criteria are defined as follows:
Start date column (PR_START_DATE) / SCD2 columns (PR_NAME) / Other columns [SCD1]
Insert: Use for loading / Yes / Yes / Yes / Yes
Update: Use for matching / No / Yes / Yes / No
Update: Use for loading / No / No / No / Yes
This ensures that the MERGE statement is generated for the load.
Fact loads would make use of their transaction date for selecting the appropriate surrogate key. In this example, if transactions were being loaded into the Sales fact table from a Sales Staging table (SALES_STG), the lookup join would be:
SALES_STG.SKU_NUMBER = PRODUCT.PR_SKU_NUMBER
AND SALES_STG.TRANSACTION_DATE >= PRODUCT.PR_START_DATE
AND SALES_STG. TRANSACTION_DATE < PRODUCT.PR_END_DATE
Notes:
See also Ralph Kimball’s article on Slowly Changing Dimensions: http://www.dbmsmag.com/9604d05.html
In this example, we have a predefined set of columns for which we require history. If a change in any column of the dimension is to be captured, the simplest approach is to use the last_update_date along with the natural key for the update matching criteria.
Instead of updating the end date of the existing dimension record with the last_update_date of the new record, we could update it with a date one day behind, i.e. last_update_date – 1. This allows the fact loads to use the BETWEEN clause instead of the inequality.
If the source system records do not have a last_update_date timestamp, the load date can be used to populate the start date and end date. What this will mean for the fact load is that all transactions subsequently loaded after the SCD change would use the latest dimension record ID, irrespective of transaction date.
Error Messages and Resolutions
This section covers some common (and not so common) error messages that are encountered in using OWB, with probable causes and resolutions.
Mapping doesn’t open, “java.lang.NullPointerException” error displayed
Version: 9.0.2.56 +
This error commonly occurs in an abnormal situation where the repository has duplicate names for the same object. The solution for this problem is to deploy and execute the RESTORE procedure (see Appendix A) in the OWB repository. Deploy the procedure in the repository schema, execute RESTORE and commit to solve the problem.
“Unable to generate stable set of records” error occurs in a MERGE load
Version: 9.0.2.56 +
This condition occurs when more than one row being loaded satisfies the update matching criteria. The solution to this problem is to verify the update matching criteria for the load to ensure that each record is unique, and also possibly enhance the mapping so that the data being loaded is “clean” and does not contain duplicate records.
ORA-0600 error occurs in a MERGE load
Version: 9.0.2.56 +
This internal error may occur because the version of the Oracle database has some bugs with MERGE. A solution is to upgrade to a later version that does not have the problem. In one instance upgrading the database from version 9.0.1.0 to version 9.0.1.2 resolved the problem.
See the following note on Metalink for more information on ORA-0600
http://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_database_id=NOT&p_id=144652.1
Appendix A
The RESTORE procedure
CREATE OR REPLACE
PROCEDURE RESTORE IS
CONT NUMBER := 0;
CURSOR C1 IS
SELECT FIRSTCLASSOBJECT,
PARAMETERGROUP,
NAME
FROM CMPMAPPINGPARAMETER_V
GROUP BY FIRSTCLASSOBJECT,PARAMETERGROUP,NAME
HAVING COUNT(*) > 1;
CURSOR C2(P_FIRSTCLASSOBJECT NUMBER,P_PARAMETERGROUP NUMBER,P_NAME VARCHAR2)
IS
SELECT FIRSTCLASSOBJECT,
PARAMETERGROUP,
ELEMENTID
FROM CMPMAPPINGPARAMETER_V
WHERE FIRSTCLASSOBJECT=P_FIRSTCLASSOBJECT
AND PARAMETERGROUP=P_PARAMETERGROUP
AND NAME = P_NAME;
BEGIN
FOR REC IN C1 LOOP
--DBMS_OUTPUT.PUT_LINE(REC.FIRSTCLASSOBJECT||'_'||REC.PARAMETERGROUP||'_'||REC.NAME);
CONT := 0;
FOR REC1 IN C2(REC.FIRSTCLASSOBJECT ,REC.PARAMETERGROUP,REC.NAME)LOOP
--DBMS_OUTPUT.PUT_LINE(REC1.FIRSTCLASSOBJECT||'_'||REC1.PARAMETERGROUP||'_'||REC1.ELEMENTID);
UPDATE CMPMAPPINGPARAMETER_V
SET NAME = NAME||CONT,
LOGICALNAME = LOGICALNAME||CONT
WHERE FIRSTCLASSOBJECT = REC1.FIRSTCLASSOBJECT AND
PARAMETERGROUP = REC1.PARAMETERGROUP AND
ELEMENTID = REC1.ELEMENTID;
DBMS_OUTPUT.PUT_LINE('1');
CONT := CONT +1;
END LOOP;
END LOOP;
END;
6
ETL Techniques, Hints and Tips
OWB Best Practices
[1] This is not the only reason for maintaining surrogate keys. For a more complete discussion about the use of surrogate keys, see Ralph Kimball’s article at http://www.dbmsmag.com/9805d05.html