April 28, 2000, Volume 3 - Number 7

Twenty Criteria for Comparing Systems

Rating Your Dimensional Data Warehouse

Ralph Kimball

Over the past two decades, data webhouses have evolved their own design techniques, distinct from transaction-processing systems. Dimensional design techniques have emerged as the dominant theme for most of our data warehouses. For some years we have had a fairly stable vocabulary that includes slowly changing dimensions, surrogate keys, aggregate navigation, and conformed dimensions and facts. Yet in spite of the growing awareness of this body of practice, we still don’t have good metrics for what makes a system more dimensional or less dimensional.

This column and the one in the next issue attempt to fill this gap. I’ll propose 20 criteria for what makes a system dimensional. Besides giving each of the 20 criteria a name, I will try very hard to define the criteria in decisive ways that let you decide whether your system complies. For any given system, I want you to assign a 0 (bad) or a 1 (good) to each criterion, and then add up the 0s and 1s. Your total system should then measure somewhere between a score of zero, representing a system completely unsupportive of a dimensional approach, and 20, representing a system as completely supportive as I can imagine.

Keep in mind that in most cases, no single vendor will attempt to meet all the criteria single-handedly. A data warehouse is always a complete system, spanning the back room, beginning with the interface to the production system of record, all the way to the front room and the end user’s keyboard and screen. Few vendors attempt to provide completely integrated solutions all by themselves. But conversely, a single vendor can ruin the system’s coherence by failing to support interface standards that make it possible to assemble a complete system. For this reason, I ask you to only score complete end-to-end systems, not a collection of disjointed packages that can never really work together.

I have divided the 20 criteria into three broad groups: architecture, administration, and expression. In most of the cases, it is fairly clear why a criterion belongs to a particular group. The architecture criteria are fundamental characteristics of the overall system that are not only “features” but are central to the whole way the system is organized. Architectural criteria usually extend from the back room, through the DBMS, all the way to the front room and the user’s desktop. Administration criteria are certainly more tactical than architectural criteria, but have been chosen to be “show stoppers” if they are missing from a dimensionally oriented data warehouse. Administration criteria generally affect IT personnel who are building and maintaining the data warehouse. Expression criteria are mostly analytic capabilities that are needed in real-life situations. The end-user community experiences all expression criteria directly.

This column describes the first 12 criteria comprising the architecture and administration groups. In the next issue of Intelligent Enterprise, I’ll discuss the last eight criteria in the expression group. But rather than keep you dangling, here are the names of all 20 criteria. (See Table 1.)

Now I’ll define the first 12 criteria as succinctly as possible.

Architecture

Explicit declaration

Conformed dimensions and facts

Dimensional integrity

Open aggregate navigation

Dimensional symmetry

Dimensional scalability

Sparsity tolerance

Administration

Graceful modification

Dimensional replication

Changed dimension notification

Surrogate key administration

International consistency

Expression

Multiple-dimension hierarchies

Ragged-dimension hierarchies

Multiple valued dimensions

Slowly changing dimensions

Roles of a dimension

Hot-swappable dimensions

On-the-fly fact range dimensions

On-the-fly behavior dimensions

TABLE 1 The 20 criteria for comparing systems.

Architecture

Explicit Declaration. The system provides explicit database declarations that distinguish a dimensional entity from a measurement (fact) entity. These declarations are stored in the system metadata. The declarations are visible to administrators and end users and affect query strategy, query performance, grouping logic, and physical storage. Facts can be declared as fully additive, semi-additive, and nonadditive. Default (automatic) aggregation techniques other than summation can be associated with facts. The default association between dimensions and facts is declared in the metadata so that the user can omit specifying the link between them. A dimension attribute included in a query is automatically the basis of a dynamic aggregation. A fact included in a query is by default summed within the context of all aggregations. Semi-additive facts and nonadditive facts are prohibited from being summed across the wrong dimensions.

Conformed Dimensions and Facts. The system uses conformed dimensions and facts to implement drill-across queries where answer sets from different databases, different locations, and possibly different technologies can be combined into a higher-level answer set by matching on the row headers supplied by the conformed dimensions. The system detects and warns against the attempted uses of unconformed facts. This is the most fundamental and profound architecture criterion. It is the basis for implementing distributed data warehouses, and especially Webhouses, consisting of far-flung organizations (with no center) sharing data over the Web.

Dimensional Integrity. The system guarantees that the dimensions and the facts maintain referential integrity. In particular, a fact may not exist unless it is in a valid framework of all its dimensions. However, a dimensional entry may exist without any corresponding facts.

Open Aggregate Navigation. The system uses physically stored aggregates as a way to enhance performance of common queries. These aggregates, like indexes, are chosen silently by the database if they are physically present. End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate. All query processes accessing the data, even those from different application vendors, realize the full benefit of aggregate navigation.

Dimensional Symmetry. All dimensions allow comparison calculations that constrain two or more disjoint values of a single attribute from a dimension in computations such as ratios or differences. Also, the underlying database engine supports an indexing scheme that allows a single indexing strategy to efficiently support query constraints on an arbitrary and unpredictable subset of the dimensions in a highly dimensional database.

Dimensional Scalability. The system places no fundamental constraints on either the number of members or the number of attributes within a single dimension. Dimensions with 100 million members or 1,000 textual attributes are practical. Dimensions with a billion members are possible.

Sparsity Tolerance. Any single measurement can exist within a space of many dimensions, which can be viewed as extraordinarily sparse. The system imposes no practical limit on the degree of sparsity. A 20-dimensional database, each of whose dimensions has a million or more members, is practical.

Administration

Graceful Modification. The system must allow the following modifications to be made in place without dropping or reloading the primary database:

a)adding an attribute to a dimension;

b)adding a new kind of fact to a measurement set, possibly beginning at a specific point in time;

c)adding a whole new dimension to a set of existing measurements; and

d)splitting an existing dimension into two or more new dimensions.

Dimensional Replication. The system supports the explicit replication of a conformed dimension outward from a dimension authority to all the client data marts, in such a way that we can only perform drill-across queries on data marts if they have consistent versions of the dimensions. Aggregates that are affected by changes to the content of a dimension are automatically taken offline in each client data mart until we can make them consistent with the revised dimension and the base fact table.

Dimension Notification. The system delivers upon request all the records from a production source of a dimension that have changed since the last such request. In addition, a reason code is supplied with this dimension notification that allows the data warehouse to distinguish between Type 1 and Type 3 slowly changing dimensions (overwrites) and Type 2 slowly changing dimensions (true physical changes at a point in time).

Surrogate Key Administration.The system implements a surrogate key pipeline process for: a) assigning new keys when the system encounters a Type 2 slowly changing dimension; and b) replacing the natural keys in a fact table record with the correct surrogate keys before loading into the fact table. In other words, the cardinality of a dimension can be made independent from the definition of the original production key. Surrogate keys, by definition, must have no semantics or ordering that makes their individual values relevant to an application. Surrogate keys must support not-applicable, nonexistent, and corrupted measurement data. A surrogate key may not be visible to an end-user application.

International Consistency.The system supports the administration of international language versions of dimensions by guaranteeing that a translated dimension possesses the same grouping cardinality as the original dimension. The system supports the UNICODE character set, as well as all common international numerical punctuation and formatting alternatives. Incompatible, language-specific collating sequences are allowed.

You have now read the first 12 criteria and can begin judging your system. These criteria are deliberately tough. I know of no data warehouse to which I would give a 12 out of 12 rating, much less a 20 out of 20. But that’s the value of a tough rating system. We have room to improve.In the next issue I will move into the front room and describe the eight expression criteria that complete our picture of a dimensional data warehouse.

Ralph Kimball co-invented the Star Work station at Xerox and founded Red Brick Systems. He has the three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at