04-2-Dimension-Sizing-Other

04-3-Reading-DIMENSION-SIZING-OTHER

AGENDA

04-3.1A closer look at Dimensions

– Hierarchies- General

- Multiple hierarchies- levels

- Members

- Types of Hierarchies

-- Drill Down

-- Snowflake

04-3.2Sizing the data warehouse

04-3.3Redundancy and Aggregates

04-3.4Multiple fact tables

04-3.5Data Marts, Conformed Dimensions

04-3.6Meta Data, Security

04-3.7Maintaining the data warehouse

04-3.8Surrogate Keys

General

As people we can comprehend of about 7 items or 7 groups of items.

Grouping is a way humans can cope with large amounts of data

If the company has very few products it is easy to analyze detailed data. As the number of products increase companies tend to group products.

In the tire business there might be groups such as

Passenger Car tires

Light Truck tires

Heavy Truck Tires

Off-the-Road tires

Specialty tires

Etc …

PRODUCTPRODUCT LINE

10015R TIREPASSENGER

 Product Line is an Aggregation OR Grouping

 Groupings or Aggregations are dimensions that “belong together”

 The term for how they relate is called a HIERARCHY

Hierarchies of attributes often occur in business

04-3.1 Dimensions - Hierarchies

Hierarchies areStructures

Of two or more levels

They are useful ways of organizing data

Another exampleof several levels:

SKU - Product - Brand - Category

6 ozpkgJellO Choc Pudding JellO PuddingJellODesserts

Hierarchies belong in the Dimension table.

Aside:

You can aggregate $ sales for these groupings

04-3.1 Dimensions - Hierarchies

Different Hierarchy Views

Different users might see different hierarchies in the same dimension data

This means that your dimension in the cube needs to handle the different hierarchies. The software is capable of doing this. It is not done in any of the labs.

Example of 2 different views:

Marketing sees the groupingas

SKU - Product - Brand - Category

6 ozpkgJellO Choc Pudding JellO PuddingJellODesserts

Finance cares more about manufacturing costs

Finance might see the relationship as

SKU-Flavour-Package Type

6 oz cup JellO Choc Pudding, Chocolateplastic

Finance is interested in sales and costing information. Different flavours have different costs and so do different packaging. The fact a product belongs to JellO Brand or Sheriff Brand does not make a cost difference but is important to the Marketing department. This means Marketing and Finance look at the same dimension but through different hierarchies. There is a way to do this in Analysis Services.

Both hierarchies are valid, and should be captured in the dimension

 Think of attributes as column headings or an independent list of labels used for reports

 Hierarchies are also known as LEVELS

04-3.1 Dimensions - Hierarchies

Not all attributes are hierarchical

Example:

Package colour

Package size

Note: The data warehouse should allow users to combine various attributes in any way that makes sense to their business view

Product Dimension – Example

Product_key
SKU_description
SKU_number
Product_description
Brand
Category
Flavour
Package_size
Package_colour
Weight
Package_Type
…etc

04-3.1 Dimensions – Hierarchies – Members

VIEWING DATA FROM HIERARCHIES

 Sometimes you want to view data

For the entire hierarchyThe top 1 or 2 levels

 At the lowest level

NOTE: If you did the labs you would have ended up with cubes to try this out

Members – asked on test– Are all the members in a hierarchy

–Only those for a specific level in a dimension

EXAMPLE:

If looking at the TIME dimension all the entries are membersand include members of YEARS, QUARTERS, MONTHS

If looking at the Month

Month is a level

Members of Month are JAN, FEB … DEC

- Lowest level members are called theleaf member
04-3.1 Dimensions – Hierarchies – Types

TYPES OF HIERARCHIES

Time is a balanced hierarchy

By balanced we mean if there are months under Q1then there are also months under Q2, Q3 and Q4

 There are members at ALL levels

Since the hierarchy is balanced it usually means the levels are easy to name

Year -- Quarter – Month

NOTE: Balanced does NOT mean that all levels have the same number of members. It is the case in the TIME dimension, but if we went to the level of daily, then each month has 28, 29, 30 or 31 days in it.

04-3.1 Dimensions – Hierarchies – Types

TYPES OF HIERARCHIES

Unbalanced hierarchy

Example:Organization Chart

Some organizations have many levels of management

Manufacturing would be different than Human Resources. Because they are unbalanced, giving names to the levels is more difficult. One structure may have additional levels.

04-3.1 Dimensions – Hierarchies – Types

TYPES OF HIERARCHIES

Some hierarchies are between balanced and unbalanced.

Example:STORE dimension is based on geography

PQ ONAB BC NY VT OR WA CADISTRICT ZACATECAS FEDERAL DISTRICT

The hierarchy might be

COUNTRY REGION PROVINCE/STATE

It is balanced because it is easy to name but the parents of some members are missing or invisible.

A hierarchy that hides some of the parent members is called RAGGED

Other examples of ragged hierarchies are product dimensions for a store that has diverse product lines. For example, a large home store sells both groceries and hardware

It looks balanced but what do you name the levels?

04-3.1 Dimensions – Hierarchies

The hierarchy just determines how or if the values for the leaf members are summarized into groups. Individual products are at the leaf level or lowest level. Hierarchies summarize them into categories, brand groupings etc.
04-3.1 Dimensions – Drill Down

ONE Definition: (Kimball)Drill Down- As adding more detail to a report by adding more row headers, from any dimension table, regardless of hierarchy

Example:

Brand Package Size Sales

Brawny 2-Pack $50

Brawny 3-Pack $110

Brawny 6-Pack $75

DRILL DOWNBECOMES 

Brand Package Size Color Sales

Brawny 2-Pack White $8

Brawny 2-Pack Brown $5

Brawny 2-PackGreen$37

Brawny 3-Pack White $22

Brawny 3-Pack Green $88
04-3.1 Dimensions – Snowflake

Important to know

Designers may be tempted to break out hierarchies into a series of parent-child table relationships, known as “snowflake” design

EXAMPLE:

Why would they use a Snowflake design?

 Usually done in the belief that it will save space,

 May be easier to move data from OLTP to OLAP since follows a similar table structure in both

 That it will better organize the data.

 Snowflakes do not save significant amounts of space

- The dimension tables don’t take up a lot of space

The design makes navigation more difficult and slower

- Joins require performance overhead

- Less joins faster query

Aside: In an Analysis Service Cube it might not cause slow speed as aggregation may already have been done when the cube was built and when added to each time period.
04-3.2Estimating the Database Size

NOTE: On exam or test

Before embarking on creating a data warehouse, you need to know roughly how big the data warehouse will be. A DW set up and maintenance can be very costly in people and resources. The organization needs to know what those costs will be.

Pricing this out is not the scope of this subject, but calculating the size of the data warehouse is. The size determines the amount of hardware resources needed to run the DW to get the benefit that analysis would provide.

Here is a simple method of estimating the size of the database based on the size of the fact table and is the method used for tests and exams (By Kimball)

1 Estimate the number of distinct rows in each dimension table at the lowest grain

- Usually known or can be obtained

- Do a count in SQL

EXAMPLE of a grocery business

Assume that a new medium size grocery chain wants to set up a data warehouse. It has been decided that the type of analysis they need can be expressed in the following statement.

The organization wants analyze product dollar sales on a daily basis and also to break it down within store. Later they want to add in sales trends by product group and by store region. Also in the future they want to do analysis by cost of goods sold.

For now the grain is Daily sales by product by store.

The dimension tables will have the following number of rows

- Time Dimension:

Lowest grain is daily2 years x 365 days = 730 days

This assumes that they are open every day.

- Store Dimension:

Known that there are 300 stores reporting sales each day

- Product Dimension:

It is known that there are approximately 30,000 products in each store and from management they estimate about 3,000of the products sell each day in a given store

We now have the number of rows in each dimension table, time 730, product 30,000 and stores 300. We also have an important factor that has bearing on the size and that is the fact that only 3,000 of the 30,000 products are sold each day.

Calculating the size of the DW (continued)

Step 1 is done

1 Estimate the number of distinct rows in each dimension table at the lowest grain

Step 2 is

2 Calculate the number of facts based on the cross product of all the dimensions

TIME / STORE / PRODUCT / DOLLAR SALES
17 / 26 / 10394 / 1,234.97
17 / 26 / 10416 / 23,900.23

THE FACT TABLE LOOK LIKE:

HOW MANY ROWS IN THE FACT TABLE?

How calculated.

For 1 day there are 300 rows for the stores

For 1 store there will be 3000 rows for the products sold

Now remember there are 730 days

Calculation is

730 Day entries300 Stores

3000 Products effected each day730 x 300 x 3000 =

Number of rows

657Million
04-3.2 Sizing the DW (continued)

Steps done

1 Estimate the number of distinct rows in each dimension table at the lowest grain

2 Calculate the number of facts based on the cross product of all the dimensionsNow we need to know the size of each row.

STEP 3 is

3 Count the number of fields in the fact table

There are 3 key fields (TIME,STORE, and PRODUCT) and 1 measure meaning 4 fields in total for each row.

Assume each field is on average 4 bytes long then the row is 4 * 4 bytes long.

The estimated size is 657 million * 16 bytes.

10,512,000,000 bytes

Later when the grain is

Daily total sales and total cost by product, by product line by store by region the size estimate will change.

Assume we have a fact table with 657 million rows and there were 4 keys and 4 measures. The size would be 9.8 GB or simply 10

There is a lot of DW’s that are terabyte size

But wait a minute …
04-3.2 Sizing the DW

There is still more …

4 The dimension table sizes are insignificant compared to the fact table

Example:30,000 product records even if it had 50 fields would only be

30,000 x 50 fields x 4 bytes = 6MB = 0.006GB

 This is insignificant when compared to the size of the DW

ASIDE: Some of the fields are quite descriptive such as name. It would still be insignificant. It is also unlikely that you want analysis on 50 different fields so the number of fields would likely be 5 to 10.

 This means that in estimating the size, the dimension tables are usually ignored as insignificant.

There is still more …. (This is beginning to sound like a commercial trying to sell you a new product to vacuum your house)

04-3.2 Sizing the DW

EXERCISE 1:

Estimate the size of a data warehouse which models inventory in a food company’s finished goods warehouses

Given

- Grain: Total product inventory count (QOH), by day, by warehouse-

Dimensions: Time, Product, Warehouse

- Assume 2 years of daily information from 8 warehouses carrying 60,000 products in each warehouse.

- Assume fact table has 3 key fields and 1 fact field

Here are the steps involved

1 Estimate the number of distinct rows in each dimension table at the lowest grain

2 Calculate the number of facts based on the cross product of all the dimensions

3 Count the number of fields in the fact table

4 Assume a field size of 4 bytes

5 The dimension table sizes are insignificant compared to the fact table

6 Indexes and aggregate tables can double or triple the size of the DW

Partial answer:

1 Estimate the number of distinct rows in each dimension table at the lowest grain

Time Dimension:2 years of days = 730

Product Dimension:60,000

Warehouse Dimension:8

2Calculate the number of facts based on the cross product of all the dimensions

Fact table = 730 x 60000 x 8 =

3 Count the number of fields in the fact table

3 key fields + 1 measure field = 4 fields

4 Assume a field size of 4 bytes4 fields x 4 bytes = 16 bytes per row

How many gigabytes is that so far????

5 The dimension table sizes are insignificant compared to the fact tableNothing is added for these tables as they are insignificant in an estimate

6 Indexes and aggregate tables can double or triple the size of the DW

NOTES on Assumptions:

An assumption was made that it was every day

-What if didn’t count or track activity on weekends

- What if don’t count on holidays?

- What if sell 60,000 products, but on average only 30,000 products are in the warehouse at any one time.

- Do we use 30,000 or 60,000?

- If we keep 30,000 in the warehouse the others are in the store, therefore use 30,000 in the calculation.

ANSWER IS IN THE ASSUMPTIONS

REALITY IS  ASK THE BUSINESS AND DO NOT MAKE ASSUMPTIONS

04-3.2 Sizing the DWEXERCISE 2:

A large retail operation uses a grain of

Daily revenue and cost, by store, by product

There are 25,000 different products carried and sold by the company, with each of the 150 stores carrying on average 20,000 different products. Over a month most of the 20,000 products (about 18,000) should have some sales but the company knows that about 3,000 products sell each day in a store.

GUIDELINES:

1 Estimate the number of distinct rows in each dimension table at the lowest grain

2 Calculate the number of facts based on the cross product of all the dimensions

3 Count the number of fields in the fact table

4 Assume a field size of 4 bytes

5 The dimension table sizes are insignificant compared to the fact table

6 Indexes and aggregate tables can double or triple the size of the DW
04-3.3 Redundancy

DW Contains Redundant Data

EXAMPLE:

Using the Product Dimension table

Brand Name will appear in many records

Product=JellO Pudding

Brand=JellOProduct=JellO Regular

Brand=JellOProduct=JellO Light

Brand=JellO

SKU Product Brand Category

6 ozpkgJellO Choc Pudding JellOPuddingDesserts

6 ozpkgJellO Choc Pudding LightJellOPuddingDesserts

NOTE: If the brand name value has to be changed (as in a Type 1 slowly changing dimension) the value must be updated for every occurrence of that value in the table

The reason for using redundancy is to speed up retrieval

Other stuff

The risk of having wrong data after an update is real,

- Must be addressed carefully

- In the data loading and

- In the data publishing processes

Often this will lead to the need for “staging tables” in an area of the database used for controlling the incoming data in 3NF before it gets stored in the Dimension tables

Other terms you might see

Kimball calls such tables the “Back Room”

Bill Inmon calls it the “Data Store”

Others call them the “staging tables”, or sometimes an ODS (Operational Data Store)
04-3.3 Aggregates

Speed Up Retrieval – Another Way

Another way to speed up retrieval is to pre-calculate and store aggregates (summaries) for commonly requested queries.

Example:

If users always run queries to get Total Sales by Day

- A standard query would have to add up all the atomic fact rows for each day

- Summarizing all products

- All stores

Result  Can take a lot of time

Solution:

Having a pre-calculated total

- Already defined and sitting in a table  can speed up retrieval significantly

These pre totaled values take up a lot of space

04-3.3 Aggregates •

In a RDBMS

-only implementation, the aggregates would be stored in their own fact tables in the database

• Recent versions of leading RDBMS are “aggregate aware”

- Meaning if you run a query to summarize data,

- The query processor will check whether there is already an appropriate aggregate table stored in the database, and will use it instead of summing the details

• In a PowerPlay (by COGNOS) implementation, these aggregates are created by PowerPlay and stored in the cube, thus serving much the same purpose as “aggregate aware”

Space Used

Aggregate tables can take up a lot of space in the data warehouse, especially if they only “lightly” summarize the atomic data

- data summarized monthly and yearly takes up space but summarized weekly, monthly, quarterly and yearly takes up a lot more space.

If the atomic data changes, the aggregates may have to be recalculated and replaced - a risk to data accuracy, and a risk of lengthy update runsThe data warehouse designer must trade off between space, speed and data accuracy
04-3.4 Multiple Fact Tables

One Fact TableMultiple Fact Tables

We’ve described simple data warehouses containing one fact table. BUT One of the values of data warehousing is to bring together data from different “subject areas” to allow for richer analysis

Example:

Sales data combined with inventory data and financial data can provide much more insight than sales data alone

How do we model multiple subject areas? Model each as if it were its own data warehouse, designing one fact table for each fact type, then put the tables into a shared schema

The fact tables will often share SOME of the same dimensions

Example:

Sales facts are joined to  Product, Time and Store dimensions

Inventory facts are joined to  Product, Time and Warehouse dimensions

It is essential that the identical dimensions be shared between fact tables, to ensure consistent information is given to users
04-3.4 Multiple Fact Tables Star Schema with 2 Fact Tables

Example: