Module 9 – Data Warehousing, Data Marts & Bridging the Information Gap

Content:

In this chapter we will look at the concepts of data warehousing, data marts and data mining.

As companies have created both flat files and databases over the years, there have been many instances where data hasn’t been analyzed, created and stored in a cohesive manner. Many companies have created databases without using the E/R and normalization process. Data from flat files have been simply moved into a database format with no redesign. Thus, much of the data stored in current databases does not give us the results we expect. One way to deal with this data problem is to create data warehouses.

DATA WAREHOUSE – a data warehouse is a subject-oreinted, integrated, time-variant, nonupdatable collection of data used in support of management decision-making.

  1. Subject-oreinted organized around key subjects of the enterprise (customers, patients, students, products, etc.)
  2. Integrated – data are defined using consistent naming conventions, formats, encoding structures and related characteristics gathered from several systems
  3. Time-variant where data in the warehouse contain a time dimension so that they may also be used to study trends and changes
  4. Nonupdateable means that data loaded into the warehouse and refreshed from operational sysems ,but it cannot be updaed by end users

The need for data warehouses include: 1) a company requires an integrated, company-wide view of high-quality information, and 2) information systems must separate informational from operational systems to improve performance in managing company data

PROCESS

  1. Data are extracted from the various internal and external source system files and databases. In a large organization, there ay be dozens or even hundreds of such files and databases
  2. The data from the various source systems are transformed and integrated before being loaded into the data marts. The datawarehouse is a set of data marts
  3. The data warehouse is a set of physically distinct databases organized for decision support – it contains detailed and summary data
  4. Users access the data warehouse by means of a variety of query languagaes and analytical tools

DATA MARTS

These may be either logical of physical. Data marts may be certain data that pertain to a particular area (customers, patients, etc.) pulled from a data warehouse. Logical data marts are views of the main data waqrehouse.

CHARACTERISTICS OF DATA

  1. Status versus Event data – status data is a before and after image; transactions are business activities that cause one or more business events to occur at a database level – an event results in one or more database actions (create, update or delete)
  2. Transient versus Periodic data – transient data are data in which changes to existing records are written over previous recors, thus destroying the previous data conent; periodic data are data that are never physically altered or deleted once they have been added to the store

SCHEMAS

Star schema is a simple database design in which dimensional data are separated from fact or event data – also called a dimensional model.

Two types of tables are in a star schema – Fact tables contain factual or quantitative data that are numerical, continuously valued, and addivite; dimensional tables hold descriptive data

Essential Rules of Dimensional Modeling:

  1. Use atomic facts
  2. Create single-process fact tables
  3. Include a dae dimension for every fact table
  4. Enforce consistent grain
  5. Disallow null keys in fact tables
  6. Honor hierarchies
  7. Decode dimension tables
  8. Use surrogate keys
  9. Conform dimensions
  10. Balance requirements with actual data

OLAP Tools (Online analytical processing)is the use of a set of query and reporting tools that provides users with multimensional views of their data. Types of activities include Slicing a Cube, Drilling Down, and Summarizing more than 3 dimensions.

Data Mining allows users to look for patterns and trends in data. It is referred to as knowledge discovery. The goals of data mining are:

  1. Explanatory – to explain some observed event or condition
  2. Confirmatory – to confirm a hypothesis
  3. Exploratory – to analyze data for new or unexpected relationships

Data mining techniques include: regression, decision tree induction, clustering and signal processing, affinity, sequence association, case-based reasoning, rule discovery, fractals and neural nets.

Links: