The Very Basics of Data Warehouse Design

The Very Basics of Data Warehousing

The Very Basics of Data Warehouse Design

Data Modeling in Data Warehouses

Entity relationship modeling is used for OLTP or transactional databases, requiring a large number of small operations. Data warehouses on the other hand require small numbers of large transactions, mostly for data loading and reporting.

i  Data Mart is a term often bandied around in data warehouse terminology. A data mart is simply a subsection of a data warehouse.

The Relational Database Model

Let’s take a quick look at a relational entity relationship diagram in Figure 1. This model was built for a customer who ships containers of good between various countries by sea.

Shipping Tracking Entities

·  Consignee. The party receiving the container contents.

·  Sender. The party shipping the container contents.

·  Vessel. The ship on which the container was transported.

·  Voyage. Ships make voyages between multiple ports where individual containers or groups of containers are transported on all or some legs of the entire voyage.

·  Shipment. An individual container shipped from one port to another.

·  Contents. The contents of a container or the Bill of Lading.

·  Group. A group of containers are transported from one port to another as a single shipment.

·  Container. An individual container.

·  Type. A container can be refrigerated, open-topped, a flatbed, amongst numerous other types.

·  Depot. A container depot or container port.

·  Damage. Containers can sustain damage.

Figure 1: Tracking Containers

·  Scrap. Damaged containers can become irreparably damaged and have a scrap value.

·  Lost. Containers can be stolen or sometimes even lost at sea.

The Dimensional Data Model

An entity relationship model is inappropriate to the requirements of a data warehouse, even a Denormalized one. Another modeling technique used for data warehouses is called dimensional modeling. In layman’s terms a dimensional model consists of facts and dimensions. What does that mean? What is a fact and what is a dimension? A fact is a single iteration in a historical record. A dimension is something used to dig into, divide and collate those facts into something useful. That isn’t really layman’s terms now is it? Let’s try to explain this a little more easily by example.

Figure 1 shows an entity relationship diagram. Let’s explain dimensional modeling in small steps. Figure 2 shows the same entity relationship with a slight difference. Vaguely, facts are the equivalent of transactional entities and dimensions are the equivalent of static data. Therefore in Figure 2 dimensions are colored green and facts are colored in blue. Note how the facts represent historical or archived data and dimensions represent smaller static data entities. It follows that dimension tables will generally be small and fact tables can become relatively huge. What does this tell us? Fact tables will always be appended to and dimension tables can be changed, preferably not as often as the fact tables are appended to. The result is many very small tables related to data in groups from very large tables.

Figure 2: Highlighting Dimensions and Facts

What effectively happens when using dimensions and facts is what is ultimately called a Star Schema. Figure 3 and Figure 4 show slightly modified, pseudo type star schema versions of the Normalized entity relationship diagrams in Figure 1 and Figure 2. In Figure 3 we can see that all dimensions would be contained within a single fact table, containing shipping history records of containers. Each row in the fact table would have foreign key values to all related dimension tables.

Figure 3: Used Containers Star Schema

Figure 4 simply contains another fact table or another subset of the data contained in the Normalized entity relationship structure in Figure 1 and Figure 2. It is quite conceivable that the two fact table sets in Figure 3 and Figure 4 should be merged into a single table, separating used, damaged, scrapped and lost containers by an appropriate type column.

There could be a small problem with the fact table shown in Figure 4. Damaged, scrapped and lost containers could either be a fact table or part of the Container dimension. This decision would depend on exactly how often containers are damaged, scrapped or lost. It is more than likely that this type of thing occurs frequently in relation to other dimensions, but not necessarily.

Figure 4: Damaged, Scrapped and Lost Containers Star Schema

What is a Star Schema?

A star schema contains one or at least very few, very large fact tables, plus a large number of small dimensional tables. As already stated, effectively fact tables contain transactional histories and dimension tables contain static data describing the fact table archive entries. The objective for performance is to obtain joins on a single join level where one fact table is joined to multiple small dimension tables, or perhaps even a single dimension table. Figure 5 shows a snowflake schema for the used container portion of the original Normalized structure in Figure 1 and Figure 2, assuming that damaged, scrapped and lost containers represented in Figure 4 are rolled into the Container dimension.

Figure 5: A Snowflake Schema

What is a Snowflake Schema?

A snowflake schema is a Normalized star schema such that dimension tables are Normalized.

i  Dimension objects can be used in Oracle to represent multiple layers between dimensions; thereby assisting Optimizer efficiency and materialized view query rewrite selection.

The schema shown in Figure 5 is actually a snowflake schema because the Type dimension should be included in the Container dimension. Figure 6 shows a star schema representation of the snowflake schema in Figure 5. Figure 6 has the Type dimension included into the Container dimension. There is no dimensional Normalization in a star schema.

Figure 6: A Star Schema

Data Warehouse Design Considerations

The Time Dimension

Quite simply put the time dimension determines how long archived or historical data is to be retained.

Granularity Level

Probably the most important design issue with fact tables is the level of granularity. Simply put this means does one store all the data or summarize it. Storing all data can lead to very large fact tables and thus very large databases. However, after data has been deleted from your transactional data sources it might be costly to discover that all the archived dimensional-fact combinations are required for reports at a later date. From a planning perspective it might be best to begin by retaining all facts down to the smallest detail if at all possible. Data warehouses are expected to be large and disk space is cheap.

Surrogate Keys

When designing tables for data warehouses use surrogate keys or unique sequential identifiers for entity primary keys. The reason for this is possible multiple data sources from different databases and perhaps even different database vendor software. Additionally data can originate from flat text files and perhaps from outside purchased data sets. Keys in different data sources could be named differently containing the same values or named the same with different values. The data warehouse needs its own unique key values specific to itself as a distinct data entity set.

Duplicating Surrogate Keys and Associated Names

When creating composite column keys for child entities, pass parent key columns into child entities. Not only copy the key values into the child entities but also the unique naming values they represent, as shown in Figure 7.

Figure 7: Duplicate Details

Referential Integrity

Many data warehouses probably do not have Referential Integrity implemented at all. If implementing Referential Integrity use constraints. If creating foreign key constraints indexing those foreign keys might be sensible, especially if dimension table primary key values are to be updated or deleted. Whatever you do never use triggers. Triggers are ten times slower than constraints for implementing Referential Integrity. Most importantly remember that a designer’s perspective of constraints being extra overhead pales into insignificance to the cost of not having those constraints for much of the possible functionality required by data warehouse data. Serious problems, particularly performance problems may result if Referential Integrity is not implemented using primary and foreign key constraints. Additionally relying on data incoming in to the data warehouse always being 100% correct is foolhardy. If the data warehouse is required to produce correct answers then data integrity is a must!

Managing the Data Warehouse

Remember that the most important consideration is servicing applications and users. If users need ad-hoc reporting then the data model must provide for that ad-hoc requirement, even though ad-hoc queries place heavy load on a database. Additionally consider aggregations, repetitive data loading, perhaps even complete reloads plus maintenance activities such as re-indexing and backup and recovery. In other words build your data warehouse such that it is not only easily maintainable but also usable with an acceptable level of performance.

Special Oracle Goodies for Data Warehouses

Materialized Views and Query Rewrite

Use materialized views for creating aggregates. Materialized views can be created and maintained in numerous different ways. Many requirements can probably be catered for. The important factor in utilizing materialized views is that queries can actually be rewritten internally by the Optimizer to read smaller physical areas of disk, covered by materialized view aggregations rather than entire fact tables. Obviously the faster that data content changes then the higher the overhead for refreshing those materialized views.

i  Query rewrite requires the setting of the QUERY_REWRITE_ENABLED parameter in the configuration parameter file.

Parallel Processing and Partitioning

Oracle is very capable of execution of parallel processing on multiple CPU platforms, especially in combination with partitioning, RAID arrays or physically separated datafiles. Partitioning is very important not only to parallel processing but also to the ability to switch in and out large chunks of data and perhaps even allow access to most of the data whilst new data is loaded into a single partition or subset of the an entire table.

i  Transportable tablespaces are a very effective method of copying large physical chunks of data between databases.

Indexing and Clustering

Indexing and clustering are important factors for increased data warehouse performance. BTree indexes are usually best for transactional activities. However, there are a multitude of methods by which BTree indexes can be speeded up for read only activity such as using key compression. Additionally there are bitmap indexes for low cardinality data plus index-organized tables and clusters for pre-sorting and pre-joining tables respectively. Materialized views have already been briefly discussed. There is immense capacity using specific types of Oracle indexing to pre-join, pre-sort and pre-classify data. In a read only data warehouse type environment these specific Oracle objects should be utilized to their fullest capabilities.

Star Transformations

The Optimizer utilizes star transformations in order to facilitate the performance of joining fact and dimension tables. Star transformations use bitmap indexes on fact table dimension foreign key column indexes, allowing various highly efficient index lookup and joining methods.

i  Star transformation requires the STAR_TRANSFORMATION parameter set to TRUE in the configuration parameter file or use of the STAR_TRANSFORMATION hint.

i  Bitmap indexing can cater not only for a few distinct values but also for thousands of distinct values efficiently. It is possible that bitmap indexes may not be as fast as some Oracle texts would lead us to believe. On the contrary it is possible that the biggest problem with bitmap indexing and possibly the result of their unfortunate reputation is consistent use in transactional databases. If frequent DML activity occurs on a bitmap index they can deteriorate to the point of uselessness over time. Additionally it may or may not be possible to state that INSERT activity does not deteriorate bitmap index efficiency as fast as UPDATE and DELETE activity.

Oracle SQL Aggregation, Analysis and OLAP

Oracle SQL contains a multitude of aggregation functions encompassing basic arithmetic through to complex statistical analysis. For instance, the GROUP BY clause can be augmented with the ROLLUP, CUBE and GROUPING SETS clauses to create multi-dimensional reports as well as a Spreadsheet clause, amongst many other capabilities. In fact Oracle SQL can now function as a more or less complete OLAP analytical set of SQL access commands, even to the point of allowing for user defined analytical functions and as already mentioned, spreadsheet capabilities.

Until next time!

Disclaimer Notice: This information is available “AS IS”. I am in no way responsible or liable for any mishaps as a result of using this information.

1 of 10