Understanding

Data Warehouse Management

A Primer

John Deitz

For Viasoft, Inc.

Status: 02September1998


Preface

In this primer, I describe a business management practice called “data warehousing” and many of the management issues that contribute to making data warehousing successful. This is a fairly non-technical coverage of the subject, which should be useful to Sales, Services, Development and Practices personnel who work with Viasoft’s solutions.

In one respect, the advent and evolution of data warehousing marks a return to basic values for running a successful business. IT organizations have long been caught up in the technology and application systems needed to automate the business. Although these applications are called “information systems”, it is closer to the truth that they are merely “business transaction” systems -- generating little real “information” that could be useful for analyzing the business.

Savvy business managers, who rise above the whir and confusion of the IT technology engine, recognize the need to accurately define the key concepts and processes at play in their business, and to begin measuring business operations against them. It takes little effort to see that the reports produced, and raw data processed, while running the business offer virtually no insights into:

· Customer segmentation, profiling and retention,

· Market basket or cross-product linkage analysis,

· Churn and trend Analysis,

· Fraud and abuse analysis, or

· Forecasting and risk analysis.

These are not the “raw” elements of data found in highly normalized transaction system database schema; rather, they are aggregations and synthesis of data into “information” … and ideally into useful knowledge.

Data warehousing is the process of creating useful information that can be used in measuring and managing the business.


Table of Contents

1. What’s a Data Warehouse or Data Mart? 6

1.1 How are Data Warehouses Produced? 6

1.2 What is the Purpose of the Data Warehouse? 7

Measurement 7

Discovery 8

Trend Analysis 8

2. How do Data Warehouse Databases Differ from Production Ones? 9

2.1 Decision Support versus Transactional Systems 9

2.2 Data Quality and Understanding 9

2.3 General Data Usage Patterns 10

Time Series Data 11

Summarized or Aggregated Data 11

2.4 Separate DSS and OLTP Systems 11

2.5 DSS-specific Usage Patterns 12

2.6 Operational Data Stores (ODS) 13

3. Data Warehouse Architectures 14

3.1 Marts and Warehouses 14

What do you mean by “Architecture” ? 15

3.2 Physical Warehouse Architecture 15

Centralized 15

Independent Data Mart 16

Dependent Data Marts, with Distribution Data Warehouse 17

Operational Data Stores - with Marts and Data Warehouse 17

Virtual Data Warehouse 18

“Hub-and-Spoke” Architecture 19

3.3 Warehouse Framework Architecture 20

Production or External Database Layer 21

Data Access Layer 21

Transformation Layer 21

Data Staging Layer 21

Data Warehouse Layer 21

Information Access Layer 21

Application Messaging Layer 21

Meta Data Directory Layer 22

Process Management Layer 22

4. What is Data Mart/Warehouse Management? 23

4.1 Managing Data Administration: Defining “Information” 23

Terms and Semantics Directory (Glossary) 23

Value Domains and Abbreviations 24

Naming Conventions 24

Data Type and Format Standards 24

Automated Data Quality Measurement 25

Information Owners 25

Managing IT-related Meta Data 25


Table of Contents (continued)

4.2 Managing Business Segment Analysis & Mart Design 26

Exploration 26

Schema Design 26

Locating Data Sources 27

Prototyping 27

4.3 Managing Database Administration 28

Managing Production Data Store Schema 28

Managing Data Warehouse, Mart and ODS Schema 28

Applying Data Types, Naming Conventions and Format Descriptors 28

4.4 Managing Data Valid Value Domains 29

4.5 Managing Data Extraction 29

Identifying and Qualifying Candidate Stores to Draw From 29

Managing Queries and Data Extraction 29

4.6 Managing Data Transformation 30

Data Conjunction 30

Data Aggregation 30

Data Cleansing 31

Data Dimensioning 31

4.7 Managing Data Traceability 31

4.8 Managing Schedules and Dependencies 32

4.9 Managing Data Quality and Usage 33

Use-based Data Quality Audits 34

Data Quality Re-Design 34

Data Quality Training 34

Data Quality Continuous Improvement 34

Data Quality S.W.A.T. Tactics 34

Data Cleansing Management 35

4.10 Managing Data Warehouse Architectures 35

4.11 Managing Operations, Facilities and End User Tools 36

4.12 Managing Basic Database Operations 36

4.13 Managing Rules, Policies and Notifications 36

4.14 Managing Meta Data Definition, Population and Currency 37

4.15 Managing History, Accountability and Audits 38

4.16 Managing Internal Access, External Access and Security 38

4.17 Managing Systems Integration Aspects 39

4.18 Managing Usage, Growth, Costs, Charge-backs and ROI 39

Usage 39

Growth 40

Cost Management 40

Return on Investment 40

4.19 Managing Change – the Evolving Warehouse 40


Table of Contents (continued)

5. Future of Data Warehousing 42

Objectives 42

Business Factors 42

Technology Factors 43

Knowledge Engineering 43

More Effective Infrastructure & Automation 43

More Effective Communication 44

6. Summary 45

Conclusion 45


1. What’s a Data Warehouse or Data Mart?

A “data warehouse” is a database, or group of databases, where business people can access business-related data; a data warehouse may aggregate business information from several organizations, and/or may serve as a central repository for standards, definitions, value domains, business models, and so on.

Bill Inmon, recognized as the father of the data warehouse concept, defines a data warehouse as “a subject-oriented, integrated, time variant, non-volatile collection of data in support of management’s decision-making process”. Richard Hackathorn, another data warehouse pioneer, says the “goal of data warehouse is to provide a single image of business reality”. Both of these definitions have merit, as this primer will illustrate.

A “data mart” is a (usually smaller) specialized data warehouse that covers one subject area, such as finance or marketing, and may serve only one distinct group of users. Since a data mart focuses on meeting specific user business needs, it can capture the commitment and excitement of users. Also, the limited scope makes it easier for IT to grasp the business requirements and work with the users effectively. Most successful data warehousing efforts begin through the successes of one or more marts.

A data mart or warehouse is a collection of “clean data”. This data usually ultimately originates from the production transaction systems that automate the business. But this data is not the same “raw” data processed by transaction systems, and more specifically it is not in the same transaction-optimal form; rather it is very selective data, organized into specific “dimensions”, and optimized for access by decision support systems (DDS), sometimes also called executive information systems (EIS).

These are the simple definitions for data warehouse and mart. The remainder of this paper will illustrate that the process of data warehousing is far from simple. Recent history provides many examples of unsuccessful warehouse implementations costing millions of dollars.

A number of things - warehouse/mart architecture, data understanding, data access, data movement, merging of data from heterogeneous sources, data cleansing and transformation, mart database design, and business user access/query tools - all have to come together successfully for a data warehouse to succeed. In short, data warehousing cannot be done successfully without significant management measures in place.

For the remainder of this paper, I will use the term “data warehouse” to refer equally to marts or warehouses – except in discussions of architecture where I draw distinctions between the two.

1.1 How are Data Warehouses Produced?

Data warehouses are usually populated with data originating from the production business systems - in particular, from the databases underlying those systems. That means existing data must be extracted from its source, cleansed, transformed (modified, aggregated and dimensioned), staged, distributed to remote sites (perhaps) and finally loaded into the data stores comprising the data warehouse or marts. The following figure illustrates this flow:

Figure 1a: Steps in Building a Data Warehouse

Actually, this figure grossly simplifies the effort that goes onto designing, creating and managing a data warehouse. Depending on the nature and quality of existing production systems, and the needs of business users querying the data warehouse, each step may require significant research, planning and coordination.

1.2 What is the Purpose of the Data Warehouse?

Basically, the purpose of a data warehouse is to put accurate information about the business into the hands of competent analysts or decision-makers. This information must present very crisp and unambiguous facts about business subjects (such as customer, product, brand, contract, representative, region, and so on) and provide useful perspectives on the current operations of the business. Here I look at just a couple roles of the warehouse.

Measurement

The analysts use the data warehouse to measure the performance of the business; these measurements are often computed from millions of detailed records of data, or sometimes from aggregations (summaries) of the data. When business performance is weak or unexpected, the analysts “drill down” to deeper levels of detail to find out why; this can mean looking at the data (i.e. querying) in a new and different way – perhaps on a one-time basis.

Here’s an example: a well-designed mart might allow an analyst to measure the effectiveness of the company’s investment in inventory. Called GMROI, or Gross Margin Return on Inventory, this measurement is computed as follows:

(Total Quantity Shipped) * (Value at Latest Selling Price – Value at Cost)

GMROI =

(Daily Average Quantity on Hand) * (Value at Latest Selling Price)

Obviously, this measurement could not be made without the right, accurate detailed data organized in a very particular pattern, and the ready ability to query the data in a number of interesting ways. Most successful data warehouses are based on a “star schema” database design that minimizes the amount of data to be stored, while optimizing the potential of query performance. (I will discuss this further later on, because not all star schema designs have the same performance potential.)

Discovery

Analysts also use data warehouse data for a relatively new form of analysis called “data mining”. Data mining involves deep analysis of potentially huge volumes of data to discover patterns and trends that no one would think to look for. Certain business events or trends can occur due to the convergence of highly unrelated facts; data mining makes it possible to see new facts, and how they are correlated. Data mining is “new” because only recently have the database and query technologies matured enough to support it with acceptable performance.

Trend Analysis

The data in a data warehouse is usually organized into “dimensions”. For example, a data mart designed to store information about stores selling products might be organized into the following dimensions: Store, Product, and Time. Time is a very common and important dimension, because it allows the measurements taken today to be compared with the same measurements taken last week, last month or last year. This is what Bill Inmon referred to as “time variant” in the quote I used earlier to define data warehouses. When a number of time-phased measurements are considered in succession, we can do trend analysis.

One significance of trend analysis is to better understand the past. Trend analysis is very effective in highlighting a declining market or shift in sales from one brand to another. Perhaps a more powerful use of trend analysis is to project the future; this is done by extending a trend into the future through the application of various statistical models.

It should be easy to see how the notions of Measurement, Discovery and Trend Analysis contribute (in very meaningful ways) to assessing how effectively a business is operating, and where management actions should be focused.


2. How do Data Warehouse Databases Differ from Production Ones?

What’s the difference between a data warehouse database, and all the other production databases that the business has? Why not just use the production databases as the basis for data warehouse queries?

These are good questions - and important ones – as we explore the success criteria for data warehousing initiatives. In this section, I’ll discuss the differences and rationales.

2.1 Decision Support versus Transactional Systems

Decision Support systems and Executive Information systems have different usage and performance characteristics from production systems. Production systems such as order entry, general ledger or materials control generally access and update the record of a single business object or event: one order, one account, or one lot. Transactions are generally pre-defined, and require the database to provide very fast access to, and locking for, one record at a time.

In fact, the requirement that production databases be “transaction-optimal” heavily influences the schema design of these databases. These database schemas are often highly normalized to maximize the opportunity for concurrent access to database data; that is, they are designed so that the smallest amount of data will be locked (i.e. unavailable to others) at any one time. In this way, transactions can have pinpoint accuracy and be highly efficient.

In contrast, databases supporting DSS must be able to retrieve large sets of aggregate or historical data within a reasonable response time. The data must be well-defined, clean, and organized into specific dimensions that support business analysis.

2.2 Data Quality and Understanding

The data records processed by production systems are usually concatenations of the master records of the key databases with contextual information. IT systems that have evolved over the years have been tuned to cater to the data anomalies found in this data; that is, they “correct” anomalies “on the fly”. Such anomalies include missing or inaccurate codes, discrepancies between order header and detail records, and garbage found in fields due to electronic transmission errors. The programmers of these systems are usually so far removed from the data entry points to the system, that it is easier (and more convenient) to adjust values during processing than to correct the source of the data.

It is common for data in production stores to get “tainted” - becoming application specific. For example, a primary datastore of customer information may be pre-filtered to contain only “active accounts”, while to the casual observer (outside IT) it may appear to encompass all accounts. Also, programmers and early database designers have traditionally been lazy about naming standards, assuming that only a small technical audience would see the names they contrived for data elements. Hence, an “outsider” cannot trust the meaning of terms such as customer, order, product, and so on – such terms were often used loosely in the past.

These subtleties and miscellaneous filters make production stores the wrong, or at least incomplete, sources of information for most decision-making purposes.

2.3 General Data Usage Patterns

As noted above, transaction systems are tuned for item-at-a-time processing, and more importantly, update processing. These systems are called on-line transaction processing (OLTP) systems. The intent of such systems to update, and to minimize the scope of database locks, places significant constraints on how the data is laid out (schema design) and how it is accessed.

By contrast, decision support systems operate against schemas that facilitate querying [perhaps very granular] information in a myriad of ways; this is a read-only mode of access. The tools used for these systems are called on-line analysis processing (OLAP) tools. A particular pattern of schema design, called “star schemas”, has proven very powerful in decision support systems. Star schemas can be used to design dimensional data stores. The term “star” was coined because the schema configuration consists of a core fact table which has relationships (foreign keys, in relational OLAP systems) to a number of dimension tables.