04-3-Reading-DIMENSION-SIZING-OTHER
AGENDA
04-3.1A closer look at Dimensions
– Hierarchies- General
- Multiple hierarchies- levels
- Members
- Types of Hierarchies
-- Drill Down
-- Snowflake
04-3.2Sizing the data warehouse
04-3.3Redundancy and Aggregates
04-3.4Multiple fact tables
04-3.5Data Marts, Conformed Dimensions
04-3.6Meta Data, Security
04-3.7Maintaining the data warehouse
04-3.8Surrogate Keys
General
As people we can comprehend of about 7 items or 7 groups of items.
Grouping is a way humans can cope with large amounts of data
If the company has very few products it is easy to analyze detailed data. As the number of products increase companies tend to group products.
In the tire business there might be groups such as
Passenger Car tires
Light Truck tires
Heavy Truck Tires
Off-the-Road tires
Specialty tires
Etc …
PRODUCTPRODUCT LINE
10015R TIREPASSENGER
Product Line is an Aggregation OR Grouping
Groupings or Aggregations are dimensions that “belong together”
The term for how they relate is called a HIERARCHY
Hierarchies of attributes often occur in business
04-3.1 Dimensions - Hierarchies
Hierarchies areStructures
Of two or more levels
They are useful ways of organizing data
Another exampleof several levels:
SKU - Product - Brand - Category
6 ozpkgJellO Choc Pudding JellO PuddingJellODesserts
Hierarchies belong in the Dimension table.
Aside:
You can aggregate $ sales for these groupings
04-3.1 Dimensions - Hierarchies
Different Hierarchy Views
Different users might see different hierarchies in the same dimension data
This means that your dimension in the cube needs to handle the different hierarchies. The software is capable of doing this. It is not done in any of the labs.
Example of 2 different views:
Marketing sees the groupingas
SKU - Product - Brand - Category
6 ozpkgJellO Choc Pudding JellO PuddingJellODesserts
Finance cares more about manufacturing costs
Finance might see the relationship as
SKU-Flavour-Package Type
6 oz cup JellO Choc Pudding, Chocolateplastic
Finance is interested in sales and costing information. Different flavours have different costs and so do different packaging. The fact a product belongs to JellO Brand or Sheriff Brand does not make a cost difference but is important to the Marketing department. This means Marketing and Finance look at the same dimension but through different hierarchies. There is a way to do this in Analysis Services.
Both hierarchies are valid, and should be captured in the dimension
Think of attributes as column headings or an independent list of labels used for reports
Hierarchies are also known as LEVELS
04-3.1 Dimensions - Hierarchies
Not all attributes are hierarchical
Example:
Package colour
Package size
Note: The data warehouse should allow users to combine various attributes in any way that makes sense to their business view
Product Dimension – Example
Product_keySKU_description
SKU_number
Product_description
Brand
Category
Flavour
Package_size
Package_colour
Weight
Package_Type
…etc
04-3.1 Dimensions – Hierarchies – Members
VIEWING DATA FROM HIERARCHIES
Sometimes you want to view data
For the entire hierarchyThe top 1 or 2 levels
At the lowest level
NOTE: If you did the labs you would have ended up with cubes to try this out
Members – asked on test– Are all the members in a hierarchy
OR
–Only those for a specific level in a dimension
EXAMPLE:
If looking at the TIME dimension all the entries are membersand include members of YEARS, QUARTERS, MONTHS
If looking at the Month
Month is a level
Members of Month are JAN, FEB … DEC
- Lowest level members are called theleaf member
04-3.1 Dimensions – Hierarchies – Types
TYPES OF HIERARCHIES
Time is a balanced hierarchy
By balanced we mean if there are months under Q1then there are also months under Q2, Q3 and Q4
There are members at ALL levels
Since the hierarchy is balanced it usually means the levels are easy to name
Year -- Quarter – Month
NOTE: Balanced does NOT mean that all levels have the same number of members. It is the case in the TIME dimension, but if we went to the level of daily, then each month has 28, 29, 30 or 31 days in it.
04-3.1 Dimensions – Hierarchies – Types
TYPES OF HIERARCHIES
Unbalanced hierarchy
Example:Organization Chart
Some organizations have many levels of management
Manufacturing would be different than Human Resources. Because they are unbalanced, giving names to the levels is more difficult. One structure may have additional levels.
04-3.1 Dimensions – Hierarchies – Types
TYPES OF HIERARCHIES
Some hierarchies are between balanced and unbalanced.
Example:STORE dimension is based on geography
PQ ONAB BC NY VT OR WA CADISTRICT ZACATECAS FEDERAL DISTRICT
The hierarchy might be
COUNTRY REGION PROVINCE/STATE
It is balanced because it is easy to name but the parents of some members are missing or invisible.
A hierarchy that hides some of the parent members is called RAGGED
Other examples of ragged hierarchies are product dimensions for a store that has diverse product lines. For example, a large home store sells both groceries and hardware
It looks balanced but what do you name the levels?
04-3.1 Dimensions – Hierarchies
The hierarchy just determines how or if the values for the leaf members are summarized into groups. Individual products are at the leaf level or lowest level. Hierarchies summarize them into categories, brand groupings etc.
04-3.1 Dimensions – Drill Down
ONE Definition: (Kimball)Drill Down- As adding more detail to a report by adding more row headers, from any dimension table, regardless of hierarchy
Example:
Brand Package Size Sales
Brawny 2-Pack $50
Brawny 3-Pack $110
Brawny 6-Pack $75
DRILL DOWNBECOMES
Brand Package Size Color Sales
Brawny 2-Pack White $8
Brawny 2-Pack Brown $5
Brawny 2-PackGreen$37
Brawny 3-Pack White $22
Brawny 3-Pack Green $88
04-3.1 Dimensions – Snowflake
Important to know
Designers may be tempted to break out hierarchies into a series of parent-child table relationships, known as “snowflake” design
EXAMPLE:
Why would they use a Snowflake design?
Usually done in the belief that it will save space,
May be easier to move data from OLTP to OLAP since follows a similar table structure in both
That it will better organize the data.
Snowflakes do not save significant amounts of space
- The dimension tables don’t take up a lot of space
The design makes navigation more difficult and slower
- Joins require performance overhead
- Less joins faster query
Aside: In an Analysis Service Cube it might not cause slow speed as aggregation may already have been done when the cube was built and when added to each time period.
04-3.2Estimating the Database Size
NOTE: On exam or test
Before embarking on creating a data warehouse, you need to know roughly how big the data warehouse will be. A DW set up and maintenance can be very costly in people and resources. The organization needs to know what those costs will be.
Pricing this out is not the scope of this subject, but calculating the size of the data warehouse is. The size determines the amount of hardware resources needed to run the DW to get the benefit that analysis would provide.
Here is a simple method of estimating the size of the database based on the size of the fact table and is the method used for tests and exams (By Kimball)
1 Estimate the number of distinct rows in each dimension table at the lowest grain
- Usually known or can be obtained
- Do a count in SQL
EXAMPLE of a grocery business
Assume that a new medium size grocery chain wants to set up a data warehouse. It has been decided that the type of analysis they need can be expressed in the following statement.
The organization wants analyze product dollar sales on a daily basis and also to break it down within store. Later they want to add in sales trends by product group and by store region. Also in the future they want to do analysis by cost of goods sold.
For now the grain is Daily sales by product by store.
The dimension tables will have the following number of rows
- Time Dimension:
Lowest grain is daily2 years x 365 days = 730 days
This assumes that they are open every day.
- Store Dimension:
Known that there are 300 stores reporting sales each day
- Product Dimension:
It is known that there are approximately 30,000 products in each store and from management they estimate about 3,000of the products sell each day in a given store
We now have the number of rows in each dimension table, time 730, product 30,000 and stores 300. We also have an important factor that has bearing on the size and that is the fact that only 3,000 of the 30,000 products are sold each day.
Calculating the size of the DW (continued)
Step 1 is done
1 Estimate the number of distinct rows in each dimension table at the lowest grain
Step 2 is
2 Calculate the number of facts based on the cross product of all the dimensions
TIME / STORE / PRODUCT / DOLLAR SALES17 / 26 / 10394 / 1,234.97
17 / 26 / 10416 / 23,900.23
THE FACT TABLE LOOK LIKE:
HOW MANY ROWS IN THE FACT TABLE?
How calculated.
For 1 day there are 300 rows for the stores
For 1 store there will be 3000 rows for the products sold
Now remember there are 730 days
Calculation is
730 Day entries300 Stores
3000 Products effected each day730 x 300 x 3000 =
Number of rows
657Million
04-3.2 Sizing the DW (continued)
Steps done
1 Estimate the number of distinct rows in each dimension table at the lowest grain
2 Calculate the number of facts based on the cross product of all the dimensionsNow we need to know the size of each row.
STEP 3 is
3 Count the number of fields in the fact table
There are 3 key fields (TIME,STORE, and PRODUCT) and 1 measure meaning 4 fields in total for each row.
Assume each field is on average 4 bytes long then the row is 4 * 4 bytes long.
The estimated size is 657 million * 16 bytes.
10,512,000,000 bytes
Later when the grain is
Daily total sales and total cost by product, by product line by store by region the size estimate will change.
Assume we have a fact table with 657 million rows and there were 4 keys and 4 measures. The size would be 9.8 GB or simply 10
There is a lot of DW’s that are terabyte size
But wait a minute …
04-3.2 Sizing the DW
There is still more …
4 The dimension table sizes are insignificant compared to the fact table
Example:30,000 product records even if it had 50 fields would only be
30,000 x 50 fields x 4 bytes = 6MB = 0.006GB
This is insignificant when compared to the size of the DW
ASIDE: Some of the fields are quite descriptive such as name. It would still be insignificant. It is also unlikely that you want analysis on 50 different fields so the number of fields would likely be 5 to 10.
This means that in estimating the size, the dimension tables are usually ignored as insignificant.
There is still more …. (This is beginning to sound like a commercial trying to sell you a new product to vacuum your house)
04-3.2 Sizing the DW
EXERCISE 1:
Estimate the size of a data warehouse which models inventory in a food company’s finished goods warehouses
Given
- Grain: Total product inventory count (QOH), by day, by warehouse-
Dimensions: Time, Product, Warehouse
- Assume 2 years of daily information from 8 warehouses carrying 60,000 products in each warehouse.
- Assume fact table has 3 key fields and 1 fact field
Here are the steps involved
1 Estimate the number of distinct rows in each dimension table at the lowest grain
2 Calculate the number of facts based on the cross product of all the dimensions
3 Count the number of fields in the fact table
4 Assume a field size of 4 bytes
5 The dimension table sizes are insignificant compared to the fact table
6 Indexes and aggregate tables can double or triple the size of the DW
Partial answer:
1 Estimate the number of distinct rows in each dimension table at the lowest grain
Time Dimension:2 years of days = 730
Product Dimension:60,000
Warehouse Dimension:8
2Calculate the number of facts based on the cross product of all the dimensions
Fact table = 730 x 60000 x 8 =
3 Count the number of fields in the fact table
3 key fields + 1 measure field = 4 fields
4 Assume a field size of 4 bytes4 fields x 4 bytes = 16 bytes per row
How many gigabytes is that so far????
5 The dimension table sizes are insignificant compared to the fact tableNothing is added for these tables as they are insignificant in an estimate
6 Indexes and aggregate tables can double or triple the size of the DW
NOTES on Assumptions:
An assumption was made that it was every day
-What if didn’t count or track activity on weekends
- What if don’t count on holidays?
- What if sell 60,000 products, but on average only 30,000 products are in the warehouse at any one time.
- Do we use 30,000 or 60,000?
- If we keep 30,000 in the warehouse the others are in the store, therefore use 30,000 in the calculation.
ANSWER IS IN THE ASSUMPTIONS
REALITY IS ASK THE BUSINESS AND DO NOT MAKE ASSUMPTIONS
04-3.2 Sizing the DWEXERCISE 2:
A large retail operation uses a grain of
Daily revenue and cost, by store, by product
There are 25,000 different products carried and sold by the company, with each of the 150 stores carrying on average 20,000 different products. Over a month most of the 20,000 products (about 18,000) should have some sales but the company knows that about 3,000 products sell each day in a store.
GUIDELINES:
1 Estimate the number of distinct rows in each dimension table at the lowest grain
2 Calculate the number of facts based on the cross product of all the dimensions
3 Count the number of fields in the fact table
4 Assume a field size of 4 bytes
5 The dimension table sizes are insignificant compared to the fact table
6 Indexes and aggregate tables can double or triple the size of the DW
04-3.3 Redundancy
DW Contains Redundant Data
EXAMPLE:
Using the Product Dimension table
Brand Name will appear in many records
Product=JellO Pudding
Brand=JellOProduct=JellO Regular
Brand=JellOProduct=JellO Light
Brand=JellO
SKU Product Brand Category
6 ozpkgJellO Choc Pudding JellOPuddingDesserts
6 ozpkgJellO Choc Pudding LightJellOPuddingDesserts
NOTE: If the brand name value has to be changed (as in a Type 1 slowly changing dimension) the value must be updated for every occurrence of that value in the table
The reason for using redundancy is to speed up retrieval
Other stuff
The risk of having wrong data after an update is real,
- Must be addressed carefully
- In the data loading and
- In the data publishing processes
Often this will lead to the need for “staging tables” in an area of the database used for controlling the incoming data in 3NF before it gets stored in the Dimension tables
Other terms you might see
Kimball calls such tables the “Back Room”
Bill Inmon calls it the “Data Store”
Others call them the “staging tables”, or sometimes an ODS (Operational Data Store)
04-3.3 Aggregates
Speed Up Retrieval – Another Way
Another way to speed up retrieval is to pre-calculate and store aggregates (summaries) for commonly requested queries.
Example:
If users always run queries to get Total Sales by Day
- A standard query would have to add up all the atomic fact rows for each day
- Summarizing all products
- All stores
Result Can take a lot of time
Solution:
Having a pre-calculated total
- Already defined and sitting in a table can speed up retrieval significantly
These pre totaled values take up a lot of space
04-3.3 Aggregates •
In a RDBMS
-only implementation, the aggregates would be stored in their own fact tables in the database
• Recent versions of leading RDBMS are “aggregate aware”
- Meaning if you run a query to summarize data,
- The query processor will check whether there is already an appropriate aggregate table stored in the database, and will use it instead of summing the details
• In a PowerPlay (by COGNOS) implementation, these aggregates are created by PowerPlay and stored in the cube, thus serving much the same purpose as “aggregate aware”
Space Used
Aggregate tables can take up a lot of space in the data warehouse, especially if they only “lightly” summarize the atomic data
- data summarized monthly and yearly takes up space but summarized weekly, monthly, quarterly and yearly takes up a lot more space.
If the atomic data changes, the aggregates may have to be recalculated and replaced - a risk to data accuracy, and a risk of lengthy update runsThe data warehouse designer must trade off between space, speed and data accuracy
04-3.4 Multiple Fact Tables
One Fact TableMultiple Fact Tables
We’ve described simple data warehouses containing one fact table. BUT One of the values of data warehousing is to bring together data from different “subject areas” to allow for richer analysis
Example:
Sales data combined with inventory data and financial data can provide much more insight than sales data alone
How do we model multiple subject areas? Model each as if it were its own data warehouse, designing one fact table for each fact type, then put the tables into a shared schema
The fact tables will often share SOME of the same dimensions
Example:
Sales facts are joined to Product, Time and Store dimensions
Inventory facts are joined to Product, Time and Warehouse dimensions
It is essential that the identical dimensions be shared between fact tables, to ensure consistent information is given to users
04-3.4 Multiple Fact Tables Star Schema with 2 Fact Tables
Example: