Dr. Bjarne Berg DRAFT
Data Warehouse Approaches
Data warehousing really came of age in the 1990s. In the beginning much of the data warehousing was done through loading the transactional data into separate tables on an external system. These tables were often “flat”, meaning that they were highly denormalized and had the data repeated for each transaction. This optimized the performance of the reporting system, removed the need for data models and simplified the development effort of the data warehouse. Unfortunately, it also caused the data warehouses to be huge even when few transactions were loaded (i.e. under this approach, all relevant customer data is replicated for each transaction record).
This was a very common approach and is still in use in many data warehouse organizations. However in 1990-95 some though leaders started to propose new ways of manage the data volume. William H. Immon proposed in his earlier books and articles, that an operational data store (ODS) was needed to be a close representation of the transaction system data. He proposed that this ODS should be modeled close to a 3rd normal form (reducing data redundancies), but that some exceptions could be made for single value fields (i.e. code descriptions) which could be denormalized into the ODS model. The benefits of the ODS was that the organization could store integrated, cleansed and standardized data in one location for all decision support systems to leverage within the organization. It would also mean that only one interface was needed from each transactional system instead of dozens feeding various DSSs. For the analytical community the organization would build analytical applications faster by reducing the number of programs needed to source the data. Under Immon’s approach the data could be sourced from a single location.
Immon also suggested that the data should only be conditioned (i.e. only extracting certain sales records), but that the data into the ODS should never be cleansed or transformed without retaining the original values as well. This would allow for data reconciliation between the system or records (SOR) and the data warehouse. Any transformation specific to an analysis process he argued, should occur when the data was moved to an analytical application. In its purest form, the ODS would never be queried, but would be used for feeding other applications, support data mining and generate batch reports only. Allowing for managed query environments (MQE) and OLAP on the ODS would significantly degrade performance of the ODS and also not provide satisfactory end user results since the table joins would significantly slow the query performance.
At the same time Ralph Kimball, the founder of the ETL company Prism, sold his assets in the company and started focusing on the data architecture of the data warehouse. Kimball initially argued that there were no real need for a ODS, especially one that used 3rd normal for as the modeling technique. Instead, he suggested that a Persistent Storage Area (PSA) or staging area of inbound data was created. This, he argued should be denormalized to the highest degree technically feasible. The staging area would then feed a set of star schemas (his newly proposeddimensional modeling technique).
The Kimball approach argued that the data within the star schema should contain a set of fact tables organizedaround a subject area such as credit management, instead of transactions such as payment records and account receivables records. These fact tables would then be associated with a set of dimensions that either was unique to that start schema or dimensions that was shared across multiple start schemas. The latter became known as conformed dimensions or “master data”. The performance of the model would come from the reduced table joins that resulted form denormalizing the dimensions that contained descriptive characteristics of an object. Common dimensions included organizations, customers, suppliers, stores, products and shippers. The only dimensions that was required for each star schema was the time-dimensions, which described the day/week/month/quarter/year of the transaction date.
This use of the dimensional data models created a comprehensive set of transactions records that shared some dimensions and which also was able to capture historical truths and changes over time through some simple modeling techniques known as “slowly changing dimensions type I, II and III”. The core benefit of the Kimball approach was that the data warehouse was optimized for query performance and could easily be used to support Managed Query Environment such as Brio, Business Objects, Cognos Impromptu and Oracle Discoverer. The modeling technique also supported the use of Relational On-Line Analytical Processing (ROLAP) tools such as Information Advantage and MicroStrategy as well. Therefore there were no need for further processing of the data to other analytical applications optimized for reporting and the owners could both reduce the data latency as well as the total cost of ownership since data would not have to be replicated many times. However, a major drawback of the classical Kimball approach is the easy proliferation of customized data marts that are never truly integrated. The Kimball approach allows for many simultaneous development efforts that can be semi-autonomous and which only require integration in areas of shared conformed dimensions. This often leads to a plethora of almost independent data marts that are hosted on the same physical hardware, but have very little else in common.So while both approaches have benefits, they also have other limitations. The classical Immon approach causes a high degree of data replications and a substantial modeling effort to mimic the transactional system(s), while the classical Kimball method does not provide an on-line storage area where data can be reconciled and minimized in volume through normalization.
After these ideas was introduced, many have opted for a hybrid method where the data warehouse employ a staging are for inbound data files that is loaded into a partially normalized ODS. From the ODS, specific applications are built that may use a dimensional data model for increased query performance. Today many vendors such as SAP and Oracle have employed this approach into their packaged data warehouse solutions. It is worth noticing that today Mr. Immon is actually a paid consultant and advisor for SAP’s Business information Warehouse that employs such a hybrid approach. In general, the best approach is to design a comprehensive data warehouse solution that covers the key architectural components including data staging, on-line data hub and ability to interactively access the data in a manner that is optimized for OLAP and/or MQEs.
1