Data Warehousing - An Overview
Information Technology (IT) has historically influenced organizational performance and competitive standing. The increasing processing power and sophistication of analytical tools and techniques have put the strong foundation for the product called data warehouse. There are a number of reasons that any organization should consider a data warehouse, which can be the critical tool for maximizing the organization’s investment in the information it has collected and stored throughout the enterprise. IT managers need to understand the rationale and benefits of data warehouses because they may need to design and implement, or procure this kingpin of business intelligence.
The data warehouses are supposed to provide storage, functionality and responsiveness to queries beyond the capabilities of today's transaction-oriented databases. Also data warehouses are set to improve the data access performance of databases. Traditional databases balance the requirement of data access with the need to ensure integrity of data. In present day organizations, users of data are often completely removed from the data sources. Many people only need read-access to data, but still need a very rapid access to a larger volume of data than can conveniently by downloaded to the desktop. Often such data comes from multiple databases. Because many of the analyses performed are recurrent and predictable, software vendors and systems support staff have begun to design systems to support these functions. Currently there comes a necessity for providing decision makers from middle management upward with information at the correct level of detail to support decision-making. Data warehousing, online analytical processing (OLAP) and data mining provide this functionality.
Here, we are to discuss what is all about data warehouse, how it helps to gain a competitive edge for an organization.
Data Warehouse - An Introduction
A data warehouse is defined as a subject-oriented, integrated, nonvolatile, time-variant collection of data in support of management's decisions. More generally, data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker, such as executive, manager, and analyst, to arrive at better and faster decisions. Data warehouses provide access to data for complex analysis, knowledge discovery, and decision-making. style="mso-spacerun: They support high performance demands on an organization's data and information. It provides an enormous amount of historical and static data from three tiers:
- Relational databases
- Multidimensional OLAP applications
- Client analysis tools
Several types of applications such as online analytical processing (OLAP), decision-support systems (DSS) and data mining are being supported. OLAP is a term used to describe the analysis of complex data from the data warehouse.
OLAP is a software technology that allows users to easily and quickly analyze and view data from multiple points-of-view. OLAP provides dynamic and multi-dimensional support to executives and managers who need to understand different aspects of the data. Activities that are supported include:
§ Analyzing financial trends
§ Creating slices of data
§ Finding new relationships among the data
§ Drilling down into sales statistics
§ Doing calculations through different dimensions where each category of data (that is, product, location, sales numbers, time period, etc.) is considered a dimension.
There are OLAP tools that use distributed computing capabilities for analyses that require more storage and processing power than can be economically and efficiently located on an individual desktop.
DSS support an organization's leading decision makers with higher-level data for complex and critical decisions. A DSS queries a data warehouse or an OLAP database for relevant information that can be compared in order to make a business decision and predict the impact of that decision.
Finally, data mining is being used for knowledge discovery, the process of searching data for unanticipated new knowledge.
Knowledge workers and decision makers use tools ranging from parametric queries to ad hoc queries to data mining. Thus, the access component of the data warehouse must provide support of structured queries (both parametric and ad hoc). These together make up a managed query environment.
Databases Vs Data Warehouses
A database is a collection of related data and a database system is a database and database software together.
A data warehouse is also a collection of information as well as a supporting system.
Databases are transactional such as relational, object-oriented, network or hierarchical. Traditional databases support on-line transaction processing (OLTP), which includes insertions, updates, and deletions, while also supporting information query requirements. Traditional databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process.
Thus databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests), That is, they can't further optimized for the applications such as OLAP, DSS and data mining.
But a data warehouse is typically optimized for access from a decision maker's needs. Data warehouses are designed specifically to support efficient extraction, processing and presentation for analytic and decision-making purposes.
In contrast to databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms.
Multidatabases provide access to disjoint and usually heterogeneous databases and are volatile. Whereas a data warehouse is frequently a store of integrated data from multiple sources, processed for storage in a multidimensional model and nonvolatile. Data warehouses also support time-series and trend analysis, both of which require more historical data.
In transactional systems, transactions are the unit and are the agent of change to the database, but data warehouse information is much more coarse-grained and is refreshed according to a careful choice of incremental refresh policy. Warehouse updates are handled by the warehouse's acquisition component that provides all required processing. As data warehouses encompass large volumes of data, they are more or less double the size of source databases.
The sheer volume of data likely to be in terabytes is an issue that has been dealt with through enterprise-wide data warehouses, virtual data warehouses and data marts. Enterprise-wide data warehouses are huge projects in need of massive investment of time and resources. Virtual data warehouses are bound to provide views of operational databases that are materialized for efficient access. A data mart is an easy-to-access repository of a subset of highly focused data for a single function or department (i.e., finance, sales, marketing) and is considerably smaller than a data warehouse. The data comes form operational information that is needed by a particular group of employees for analysis, content, presentations all in terms that are familiar to them. Data for a data mart is derived from a data warehouse or from more specialized access.
Distinctive Characteristics of Data Warehouses
- Data warehouses are supposed to be blessed with the following unique features.
- Multidimensional conceptual view and generic dimensionality,
- Unlimited dimensions and aggregation levels and unrestricted cross-dimensional operations,
- Dynamic sparse matrix handling,
- Client/server architecture and multi-user support,
- Accessibility and transparency, intuitive data manipulation and consistent reporting performance
As data warehouses are not much particular about transaction processing, there is an increased efficiency in query processing. There are some specialized tools and techniques. They are query transformation, index intersection and union, special ROLAP (Relational OLAP), MOLAP (multidimensional OLAP), DOLAP (Database OLAP) and WOLAP (Web OLAP) functions, SQL extensions, advanced join methods, and intelligent scanning.
Traditional OLAP products are also known as multidimensional OLAP. Relational OLAP tools take data from traditional two-dimensional or relational databases and create multidimensional views upon request rather than being prepared in advance as in OLAP. ROLAP is often used on complex data with a wide number of fields, such as customer data. DOLAP is a relational database management system designed to perform OLAP calculations. WOLAP refers to OLAP data that can be reached from a Web server.
Parallel processing can enhance the performance of data warehouse. Parallel server architectures include symmetric
Data Modeling for Data Warehouses
Multidimensional models highly take advantage of inherent relationships existing in data to populate data in multidimensional matrices referred to as data cubes. If the dimensional of the matrix is greater than three, then it is called hypercubes. Query performance in multidimensional matrices for data that lend themselves to dimensional formatting can be much better than in the relational data model. For a corporate data warehouse, three examples of dimensions would be the corporation's fiscal periods, products and regions.
Multidimensional models lend themselves readily to hierarchical views such as roll-up display and drill-down display. Roll-up display moves up the hierarchy, grouping into larger units along a dimension. A drill-down display provides the opposite capability, furnishing a finer-grained view through disaggregating process.
The Multidimensional storage model involves two types of tables: dimension tables and fact tables. A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. This fact contains some measured or observed variables and identifies them with pointers to dimension tables.
Two common multidimensional schemas are the star schema and the snowflake schema. The star schema consists of a fact table with a single table for each dimension. The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them. A fact constellation is a set of fact tables that share some dimension tables.
Data warehouse storage also utilizes indexing techniques to support high performance access. A technique called bitmap indexing constructs a bit vector for each value in a domain (column) being indexed. It does well for domains of low-cardinality. Bitmap indexing can provide considerable input/output and storage space advantages in low-cardinality domains. With bit vectors a bitmap index can provide dramatic improvements in comparison, aggregation, and join performance.
In a star schema, dimensional data can be indexed to tuples in the fact table by join indexing. Join indexes are traditional indexes to maintain relationships between primary key and foreign key values. They relate the values of a dimension of a star schema to rows in the fact table. Data warehouse storage can facilitate access to summary data by taking further advantage of the nonvolatility of data warehouses and a degree of predictability of the analyses that will be performed using them.
Building a Data Warehouse
Warehouse builders should take a broad view of the anticipated use of the warehouse. The design should specifically support adhoc querying that is, accessing data with any meaningful combination of values for the attributes in the dimension or fact tables. The following steps are being involved during the data acquisition phase.
1. The data must be extracted from multiple, heterogeneous sources such as databases or other data feeds.
2. Data must be formatted for consistency within the data warehouse. Names, meanings and domains of data from unrelated sources must be reconciled.
3. The data must be cleaned to ensure validity. Data cleaning is an important part in building a data warehouse and it is being touted as the largest labor-demanding one. For input data, cleaning process has to be performed before the data are loaded in the warehouse. Data warehouse builders have to check for validity and quality when the input data must be examined and formatted consistently.
4. The data must be fitted into the data model of the warehouse. Data from the various sources must be installed in the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy databases.
5. The data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task. Monitoring tools for loads as well as methods to recover from incomplete or incorrect loads are required. With the huge volume of data in the warehouse, incremental updating is usually the only feasible approach. The refresh policy will probably emerge as a compromise that takes into account the answers to the following questions.
How up-to-date the data must be?
Can the warehouse go off-line, and for how long?
What are the data interdependencies?
What is the storage availability?
What are the distribution requirements such as for replication and partitioning?
What is the loading time including cleaning, formatting, copying, transmitting and overhead such as index rebuilding?
Data storage in a data warehouse involves the following processes:
Storing the data according to the data model of the warehouse
Creating and maintaining required data structures
Creating and maintaining appropriate access paths
Providing for time-variant data as new data are added
Supporting the updating of warehouse data
Refreshing and purging the data
The sheer volume of data in the warehouse generally makes it impossible to simply reload the warehouse in its entirety later on. There are a couple of alternatives for this problem such as selective refreshing of data and separate warehouse versions. When the warehouse uses an incremental data refreshing mechanism, data may need to be periodically purged.
Data warehouses should also be designed with by taking the environment in which they reside into account. The important points behind the data warehouse design are
. Usage projections
. The fit of the data model
. Characteristics of available sources
. Design of the metadata component
. Modular component design
. Design for manageability and change
There are mainly two types of data warehouses. They are distributed warehouse and federated warehouse. For a distributed data warehouse, all the issues such as replication, partitioning, communication and consistency, of distributed databases are taken into account. As usual, the benefits of distribution, such as load balancing, scalability of performance and higher availability, are available with distributed data warehouse. A single replicated metadata repository would reside at each distribution site.
The idea of the federated warehouse is like that of the federated database: a decentralized confederation of autonomous data warehouses, each with its own metadata repository. Given the magnitude of the challenge inherent to data warehouses, it is likely that such federations will consist of smaller-scale components, such as data marts. Large organizations may choose to federate data marts rather than build huge data warehouses.