Star Schema and Atomic Schema

Star Schema and Atomic Schema

Background Of Design Approaches

Original author: Mark Beyer - September 14, 1999

Adapted by: Daphne Png – April 2001

Background and Understanding.

During the course of the past eight years of data warehousing practice, two major schools of theory have developed regarding the underlying data architecture for a data warehouse—the star schema and the atomic data warehouse. The data warehouse market has not declared either approach a clear winner in application theory because each has its own set of advantages and disadvantages.

Ascential Software’s consulting approach (as specified in e-Iterations) is to develop the atomic data warehouse level and populate departmental data warehouses (or data marts) by populating star schemas from the atomic data warehouse level. This approach minimizes risk for future development and takes the greatest advantage of the two types of architecture.

Under a star schema, the approach indicates that there is a central fact table that has a level of granularity that is the lowest level of analysis users require. Granularity is the frequency at which new records are written. The spokes or “points” of the star are referred to as dimensions. Dimensions are ways to group data. For example, if the fact of a star schema is credit card transactions, then a dimension might be customer. How many transactions does one customer have in a month? What is the average transaction amount for each customer? Each customer in a month? How many retailers are visited by each customer in an average month?

In creating a star schema, it is generally very easy to identify the fact table at a “lowest level of grain” for the current needs. Further, fact tables at the lowest level of grain are usually close to the structure that might be used in an atomic data warehouse design. Issues with star schemas and compatibility usually arise in the dimensional nature of the star.

The fact table is designed to hold most of the change, and dimensions are expected to be more constant over time. In this manner, the depiction above implies that fact tables (green) are very large, and dimension tables (blue) are very small by comparison.

One of the primary purposes of creating a star is to make it easy for users to develop their queries. Dimension tables need to remain as small as possible in a star schema; otherwise they begin to fail in this objective. Unfortunately, providing analysts data in a compact dimensional model either forces dimensions to grow larger and more complex, or an “explosion” of fact records.

If an analyst requires enough detail to see each change to a customer’s age by month (i.e., auto insurance policies are sold every six months), then a new record would be written each month. If marital status were also important, then a record would be written at each month, and another record written if the customer married or divorced at mid-month (producing three records, one at month beginning, one at month end, and one at marital status change).

Further, if users also need to know when a customer changes their name, then this also produces a record. If marketing uses the customer dimension, it is possible that no more records will be necessary to meet the granularity needs of customer. Multiple star schema theory holds that when extending the data model of one star for the use by another, we can identify the primary dimensions in the data warehouse and connect the stars through these dimensions.

As soon as billing is introduced new requirements will emerge. Users now need customer records for change of address. The warehouse needs to write new records if the responsible billing party changes. If the customer data is captured in either a central transaction table or in the customer dimension, the warehouse suddenly has a huge number of records in long records. These long records are in either the fact table or the dimension table. There is no alternative. At this point we are forced to recognize that a dimension table is also a fact table for some other view of the data.

If size of the data tables were the only consideration, there would be very little issue. Data warehouse fact tables are large, and significant DASD is required as history is accumulated. In the grand scheme of things, DASD is a far less expensive than in the previous computing era. However, another far more consequential issue arises—only part of the customer dimension table applies to each fact table. The customer dimension now needs special keys, codes or other data columns to indicate the level of granularity so it can be used across different sets of facts.

Given that the needs of end-user groups are not identical, the dimensional tables are better represented as partial data sets. First, one design is established for the first set of users, then additional data elements (attributes) are added as new user communities enter the arena. However, some user groups want a separate grain of data.

Eventually, instead of one table, the dimension is actually keyed to different end-users by sub-table or domain keys. A more accurate representation of an older dimension table shows that differing levels of granularity hold some data elements in common; this forces keys to be created to link together rows that actually have natural referential integrity. In other words, it should have been modeled this way in the first place.

Extending the issue into multiple star schemas that are attempting to join with this type of “variegated” dimension only compounds analysis questions that are obvious to the analyst. When data from differing levels of granularity overlap in a dimension, there is in reality a related set of atomic fact tables related to each other and all having their own granularity that are only serving as a dimension.

This begs the argument that information systems are denormalized and this should be accepted. The problem is that the user data should be denormalized for their use, but must remain normalized to maintain one version of the truth.

Various resolutions to this issue have been attempted. In some cases, the customer table is split into two tables in separate databases with some referential integrity to each other controlled by keys. In other cases, user requirements indicate that the two levels of grain are not currently required and so they are not modeled or populated—postponing extensibility of the database for a later date.

The disadvantages of the star schema include:

· A geometric increase of records always occurs in slowly changing dimensions used across multiple star schemas. The most common slowly changing dimensions are customer, employee, corporate organization, and product/item.

· Different user requirements of granularity result in separate tables for what should be the same data in a relational model. This is sometimes resolved (hidden?) by placing the data in the same table, but then logically breaking it into multiple tables with domain keys.

· When the application of a dimension changes drastically, the star schema requires the introduction of new “mini” dimensions, new dimension tables, and potentially the scrapping of the old table.

· Multiple versions of one truth exist because it is possible to roll-up dimensions differently from one star database to another to get to the user required grain of change in the records.

· Maintaining only the star schema model as the ‘base’ of data focuses the pressure of changes to meet all data requirements on this model only. This will constantly impact the efficiency and maintainability of source data extraction and transformation processing designs.

Unfortunately, while the star schema causes “skewing” of the data, it has the most usable data model relative to unsophisticated users. Therefore, a star schema is the best end-user data model.

The atomic data warehouse resolves the issues of the star schema approach but raises some of its own.

There is minimal consideration as to whether a table will be used for a fact or a dimension when the atomic model is being developed—it is not a primary concern. It is best to consider all tables in the atomic level as fact tables. The atomic data warehouse model is first and foremost a relational model of the industry in which you do business. All human resource departments have employees, benefits, administered health plans, managers, departments, etc. The way they relate to each other over time is the truly mutable fact. The way the facts of your industry relate to each other is how you do your business in your industry.

Various methodologies exist in the market place regarding the analysis of subject areas of data and business process models. Most derive from the following principles:

· A certain set of tasks must be accomplished to deliver any business.

· Tasks can be grouped into areas for the purpose of creating job codes and job descriptions.

· Jobs all fall within a functional hierarchy in which similar jobs done by similarly qualified staff persons can be managed for cost and revenue realization and accounting.

· A business practice may consist of several functional areas, but they are extremely flexible and often fluid because the grouping of jobs and tasks is dependent upon how an organization will report revenue and cost.

In simple terms, if the tasks required to deliver a specific business in the market place can be identified it is up to the organization to determine the method of managing and delivering that business. The tasks in a business do not change based on management philosophy. A lending institution always needs to take applications, assess risk to revenue, deliver funds, collect payments, and pay for the infrastructure to do all of these things.

The definitions of information and knowledge are how to review tasks to gain insight into how to deliver a given business better. Operational data equates to “where are things now and how do we move them on?”. Informational data asks the question “how can I put things together differently?” It is best to isolate data relationships in an information architecture, rather than embedding them into the fact tables—because forcing relationships into tables assumes permanence, thus forcing facts into the role of dimension tables.

In an atomic data warehouse, data is modeled to take into account both the source system and the end-user requirements. For example, the target model may require specific data from the source based on requirements. Half of the required data may be located in one source file or table, and the other half may be scattered across five source files or tables. When we look at the data, the five tables are truly operational in nature (five screens input at different regularity by data entry clerks feed the tables, but they really are input for the same referential level). In reality, this data could be put together into one table because it is really incremental data put in by different staff persons—but part of the same record. Our users indicate that they just want one record in their model because all of the data comprises a complete record and partial data is of no use. So instead of one dimensional table in a star schema with records being written under the two change circumstances, the atomic data warehouse writes a record only in one of the two targets each time the content data experiences a complete change. Fewer records are written.

Also, in the atomic data warehouse model, no special keys will ever emerge to reconcile two “versions of the truth”. There is only one version and users will get to determine how much they want to know.

The issue here is that in the atomic data warehouse users generally cannot navigate the model. Power users can access this model, the intermittent and casual users will move increasingly away from any use of this data model.

The disadvantages of the atomic data warehouse model are:

· Time to market can be impacted if a project attempts to model the entire enterprise, and especially if the project attempts to populate the entirety of such a model.

· An atomic data warehouse model is not user friendly. In fact, an atomic model forces an assumption that all users are power-users.

The advantages of the atomic data warehouse model are that dimensions can be created as need arises from existing tables and it is not necessary for a user to specify all of the applications of a dimension from the beginning, before designing the database and populating the data. More importantly, users do not have to have a complete knowledge before they even review the data.

While simplification is key to success over the long term, data warehouse designers must set the proper infrastructure from the beginning to ensure a warehouse remains simple to maintain, iterate and use, despite increasing complexity over time. A stable data architecture, which may build on pre-existing, industry models, is one important ingredient in this solution and atomic-level models provide a definition of this architecture.

The Data Warehouse Data Model is physically modeled to ensure those tables necessary for the atomic level, which is the lowest level of data required to satisfy the business information requirements for the current implementation, are fully attributed and properly designed. Often the End Users do not directly access the atomic level data, but rather access data that has been summarized, combined, and derived from this atomic level.

The atomic level data model defines the tables, columns, and access paths required to physically create the data warehouse database. It is designed to facilitate optimal control, storage and historical integrity of the data. As a result, the design of the atomic level may be less focused on providing a high degree of data access performance, and more concerned with ensuring a high degree of efficiency in loading, storage, and historical integrity.

The atomic level of the data warehouse should be designed to meet the informational requirements of the entire organization. However, the scope of the current implementation is the focal point for current design and development activities and decisions. Designing the physical data model, therefore, requires a balance between a focus on the stated goals and objectives of the current implementation, and a focus on the overall informational requirements of the organization. An excellent guideline is to design for the vision and build for the agreed requirements.