A Data Warehouse Architecture in Layers for Science and Technology

A Data Warehouse Architecture in Layers for Science and Technology

André Luís Andrade Menolli1, and Maria Madalena Dias2

1 Departamento de Ciência da Computação, Faculdades Luiz Meneghel,

Rod. BR 369, Km 54, Caixa Postal 261

86360-000 Bandeirantes, Paraná, Brasil

2 Departamento de Informática, Universidade Estadual de Maringá,

Avenida Colombo 5790

870200-900 Maringá, Paraná, Brasil

Abstract

The data warehousing technology has been widely used in companies with the aim of offering organization, management and data integration. In the knowledge discoveryprocess in database, the first step is the data preparation, where the data must be organized and stored in the data warehouse (DW). The construction of a DW is a complex and laborious task because it demands a high knowledge of the involved company business, and also of the contents of data sources and of the technologies, such as database, data integration and data warehousing. The choice of the architecture is an important decision to be making to build a DW. This paper presents a DW layered architecture of integrated and increased data marts for management in Science and Technology (S&T). This architecture is applied to build a DW that integrates the institutional bases of CNPq and CAPES.

Keywords. Data Warehouse, Science and Technology, Data Integration.

1. Introduction

Companies of different activities and government organizations have data in several data sources that contain important information for the support to decisions making. Thus, to make these decisions are necessary the understanding and preparation of these data.

Data Warehousing provides tools to support decisions making through the integration of all company data in a repository. The data can be accessed easily and analyzed without the lost time involved in the manipulation and processing. “DW is an integrated, non-volatile, time-variant set of data based on subject matter, supporting management decisions” [19].

The organization can to build a DW or integrated Data Marts (DM) that facilitates the process and does not require great initial investment. DM is a departmental DW, it´s built for an organization’s particular area [19].

Not only companies, but any market segment which want to do the data integration of several databases, a DW can be built, making possible the use of an integrated database in the decisions making.

The outcome of the use of a DW for the information processing is a direct profit of productivity and time. The data are easily accessed and analyzed without expending time in the manipulation and in the processing. The decisions are taken quickly and safely, because the data are available at the exact moment they are necessary and with the necessary consistency.

To have a good planning in S&T is necessary that the manager has safe information for the decision making. For this, a DW is built, and DW architecture would have to be chosen or defined.

The Brazilian Science & Technology (S&T) environment have important databases that are consolidated and trustworthy. The main bases are: Brazilian Research Groups [5], Brazilian Researchers Curriculum (Lattes Curriculum) [5], and Brazilian Post-graduation [4].

Despite exist in Brazil important databases, those databases are independent and do not give a global vision of research in Brazil. It is necessary integrate theses databases, because the CAPES database there are just data about Brazilian Postgraduation while the CNPq database there are data only about Brazilian researcher. Hence is very important the integration of these databases to understand the behavior of research and researchers in Brazil and a DW is a way to obtain it.

The remainder of this paper is organized as it follows: Section 2, architectures proposed by other authors are related; Section 3, the proposed architecture is described; Section 4, the steps to build a DW for management in S&T are described and its data model is presented;Section 5, some results obtained through queries to the DW are showed. Finally, in the Section 6 the conclusions of this paper are presented, and in the Section 7 some future works are suggested.

2. Related Works

In the project of a DW, the choice of the architecture and development methodology should be the first decisions to be making, because the organization of the components depends of these choices.

Inmon [19] presents a DW architecture in that a global DW is built to the necessity of whole company. The data storage is composed by a decision support for common data repository, available for the all company. In this architecture it is possible to build a centralized DW or a distributed DW. According to [19], this architecture has as advantage the access to the data corporate vision, however, this architecture has a high implementation cost and a long time of development.

Kimball et al. [17] proposed the Data Warehouse Bus Architecture in that the storage area is composed by several data marts projected to maintain the data integration. This Architecture is divided in two great components, the Back Room and the Front Room. The Back Room represents the tools and the processes for the acquisition of the data sources (Extraction, Transformation and Load - ETL). The Front Room represents all tools and processes necessary for interaction with the final user. According to [17], in spite of the Bus Architecture make possible to search the first results faster and with less cost, it has complex metadata to manage data distribution and integration.

Singh [9] presents an architecture model in multiple layers for a DW, containing the following layers: Message of the Application Layer, Metadata Layer, Information Access Layer, Data Access Layer, DW Layer, Staging Area Layer, Operational Data Layer, and the ProcessManagement Layer. This architecture proposes a model in layers but it does not follow the style of architecture in layers presented for [14].

3. Proposed Data Warehouse Architecture

In this paper is proposed a layered architecture of integrated and incremental data marts. In this architecture, new data marts can be implemented or data marts can be modified allowing inclusion of new requirements. The main characteristic of this architecture is the replace of data sources of several formats for a standardized database, facilitating the data integration.

This architecture is constituted of five layers, represented in Figure 1 and described in the following sections. The thick arrows represent the load of data, and the fine arrows represent the access to the data.

In the architecture in layers, the main advantage is the independence of the layers, which facilitates the maintenance of the layers. A layer hides details of implementation of the others layers. Interfaces between the layers are defined to allow access of services provided for the layers.

3.1. Data Source Layer

In any DW environment the extraction of data from some data source is necessary, because DW is not a conventional database. The Data Source layer represents all data sources that are used as information suppliers to DW. These data sources can be in the reference format or in others formats. The reference format is chosen by the DW planner according with the format in that the data will be stored in DW.

3.2. ETL Layer

In this layer the data are extracted of the data sources and they are stored in the staging area. In this layer there are the largest problems in the development of a DW. According to [3], estimate that more than 1/3 of the cost and time are spent to ETL and problems like extraction, transformation and cleaning, which can take up 80% of the time spent in the development of a DW [11].

The ETL layer is divided in two modules, a migration module and another of extraction, transformation and load. The migration module transfers data in several formats for a reference format chosen by DW planner. The module of extraction, transformation and load collects relevant data, cleans, integrates, transforms and stores these data in the staging area.

When a data source does not be in the reference format, a data model must be defined and a data structure must be created in the reference format for these data.

The ETL layer makes possible the integration of data of several sources, facilitating the maintenance and the insert of new data sources

3.3. Staging Area Layer

The staging area is an intermediate step between the extraction of the data source and the load in DW. In this area the data receive the first cleanings, transformations and integrations.

According to [16], a fundamental concept that greatly simplifies DW projects and facilitates maintenance is the use of data staging areas. Starting from the logical project of the staging area, it is possible to have a good idea of the attributes and sources tables necessary to populate the DW. The data staging should just contain necessary information to populate the warehouse.

This layer and the ETL layer include all databases and tools that are necessary to transform, integrate, load, control the quality and clean the data [2].

The staging area, defined in this work, uses a normalized data model to allow a best data consistence and to facilitate the integration process among different databases.

3.4. Data Warehouse Layer

The DW layer represents integrated data marts. Dimensional model was used in the data model of the data marts and a bus matrix was elaborated to identify the intersection between the data marts.

The dimensional model was used because, according to [21], there are two main advantages of using a dimensional model in data warehouse environments. At first, a dimensional model provides a multidimensional analysis space in relational database environments. At second, a typical and not normalized dimensional model has a simple schema structure, which it simplifies the processing of query and improves performance.

In the definition of the DW data model, the first step is the definition of the data marts that are in a single data source initially. In the second step, the dimensions are identified for data marts defined in the first step and the intersection between the data marts and their dimensions are marked. In the third step, is elaborated the bus matrix. In the sequence, the granularity of the fact tables of the data marts would have to be declared. Also is verified which dimensions are related to these tables. Finally, are defined the facts that compose the fact tables.

Following this implementation model, there is the possibility of superposition of facts between the fact tables of the data marts, and replication of dimensions. To solve this problem is used a common data area, which it consists of dimensions and facts used by several data marts.

3.5. Analysis Area Layer

In the analysis area layer are defined the responsible components for the search of information and interesting knowledge in DW. These components can be classified as: OLAP tools, data mining tools and materialized views. For any type of component that is used, a special data preparation is necessary.

The OLAP tools allow analysis and management, providing a great profit of performance, as fast access to a great variety of data views organized through the multidimensional database.

The data mining tools in general are more complex than the OLAP tools. The complexity of these tools is related with the complexity of the algorithms that implement the data mining techniques. The main objective of the data mining tool is to discover occult and relevant knowledge in the database.

A simpler solution than data mining techniques is the use of materialized views. In this area there are many studies developing methods to facilitate queries and maintenance of data cubes. In [18] was developed several algorithms to compute set of aggregations related in data cube.

An advantage of the use of materialized views is that the time of queries processing is lesser than other methods. A disadvantage is the great space in disk that is necessary to store the queries. A solution can be the use of dedicated disk storage for managing materialized views, separate from DW, and not decreasing the DW performance [20].

3.6. Metadata

The metadata contains information about data used by several components of the architecture. The information contained in this repository is essential for the complete operation and maintenance of the components.

It is necessary to know the information on the data that compose the DW, as the datawere integrated and transformed, and as information and knowledge can be extracted from DW.

The structuring of the data stored in the metadata repository is done through rules defined in the ETL layer and by data models and descriptive tables of the DW layer.

Besides the data of DW, the metadata should store data relative to the logical project, data model of the data marts, of the control area and of the aggregate; data relative to the physical project; data relative to the maintenance; and data relative to the queries of the users

4. DW for Management in S&T

A DW for management in S&T was built following the defined steps for [17].

Initially a planning on the activities was made and had been identified the necessary resources.With the definition of the business requirements it was possible lists the data that need to be stored in the DW to give supported to the management in S&T.

The DW was projected according to the star schema. This modeling has been chosen for the already presented advantages in this research work and because dimensional modeling is widely accepted as the viable technique for delivering data to final users in a DW [1], [7], [13], [17].

A Bus Matrix was used to define data integration rules. This matrix represents the fact and dimensions tables previously structured from way to maintain integration between the data marts, as shown in Figure 2.

If occur superposition of dimensions in the data marts, is created a common area, as shown in DW layer of Figure 1, reducing the data replication between the data marts.

The ETL, Staging Area and DW layers were projected and implemented using resources as triggers and stored procedures of SGBD Oracle [15], besides tools as ERWin [6] for data modeling, Rational Rose [10] for the project of the layers and Delphi [8] for creation of the module of migration of the ETL layer.

The implementation of the migration module have been developed in Delphi 7.0 to make possible migration and integration of the data, because for each institution there was a different database of CAPES.

The extraction, transformation and load module of the ETL layer was implemented through stored-procedures built with the PL-SQL language of Oracle, solution that was quite efficient in performance and flexibility.

4.1. Data Model of Staging Area

The staging area modeling was made using the relational data model normalized together with techniques

of the dimensional modeling, thus avoiding redundancy of the data and facilitating the conversion of the data for DW. The tables were modeled according to the BUS matrix,and the attributes of the tables are based in the defined dimensions.

In the Figure 3 is shown a partial data model of staging area. The model is normalized, but it also possesses a modeling used in dimensional models in some relationships. This modeling is bases on the approach proposed by Kimball et al [17] in which for tables that possess fields withmore of one value it is created a mini-dimension to support the several values of the field. To relate this mini-dimension with the derived table, it is created a bridge table, which creates a group of values for each individual that possesses some fields in mini-dimension.

In the data model presents in the Figure 3 the tables in light gray indicate mini-dimensions, while the dark gray indicates bridge tables.

Using this modeling in staging area the data integration from stage area to DW is easier and fast.

This same approach is used in the DW for dimensions that possess fields with more of one value.

From the data sources until the data in the formats that appear in the model, shown in the Figure3, the following steps are executed:

Separating the data by granularity.
Cleaning values of attributes.
Transforming the data.
Changing the operational keys by substitute keys.
Guaranteeing the quality of the data.

The uses of these steps give the follows benefits to the data: quality, standardization and consistency, facilitating the data integration and the data load in the DW.

4.2. Data Model of DW

The DW was projected according to the star schema, however the star schema does not accept many-to-many relationships, and this type of relationship presented in the staging area of our project, was transformed into many-to-many relationships between facts and dimensions in the DW. According to Song et al [21], those relationships in a dimensional model causes several difficult issues, such as losing the star schema structure, increasing complexity in forming queries, and degrading query performance by adding more joins. Therefore, it is desirable that we handle the many-to-many relationships while still keeping the structure of the star schema. There are some works to solve this problem, as [22], [23], [24], [25], [26], [17]. Among these solutions, the chosen one was presented in [17].

The many-to-many relationships between facts and dimensions and the relationships between dimensions and mini-dimensions presented previously are shown in Figure 4. In Figure 4 it is also possible to verify that more than one fact table uses the same dimension, according to what was proposed with the use of the DW Bus architecture matrix.

4.3. Analysis Area

In this layer were made some studies using different types of tools.

Some case studies were accomplished with Oracle9i OLAP'S tool, but the need to follow very rigid steps through a pre-established model interfere a little his use. In the next section are presented some results of those case studies.

5 Case Studies

A DW is an inquiry database that can aid in the process of decision making. Thus, in this section the results of case studies are presented accomplished with DW built, that integrate data about Brazilian research institution and researchers.