in partnership with

Title: / Functional Architecture of the S-DWH
WP: / 3 / Deliverable: / 3.3
Version: / Author: / Date: / NSI:
2.5
2.7 / Björn Berglund
Antonio Laureti Palma / April 2013
June 2013 / Statistics Sweden
Istat
3.0 Final / Antonio Laureti Palma / October 2013 / Istat

ESS - NET

On micro data linking and data warehousing
in production OF BUSINESS STATISTICS

Summary

1

1.Introduction......

2.S-DWH Functional Architecture......

2.1S-DWH management functionalities......

3.Statistical Business Process

3.1Source layer......

3.2Integration layer......

3.3Interpretation layer......

3.4Access layer......

4.S-DWH Functional diagram

4.1About GSIM......

4.2FD Strategic functionalities......

4.3FD Strategic functionalities......

4.4S-DWH is a metadata-driven system......

5.Examples of functional flow through the warehouse......

5.1General variable flow......

5.2Data editing......

5.3Derivation of value......

1

1.Introduction

The main purpose of a data warehouse is to integrate and store data generated as a result of an organization's activities. In this case a DW system is one of several components of the production infrastructure and, using the data coming from different production departments, is generally used to optimize the supply chain or carry out marketing.

In NSIs, where statistical production processes of different topics are produced following stove-pipe-like production lines, i.e. independent statistical production processes, the DW is generally used to collect final aggregate-data and in this configuration the DW is generally an output system.

When the statistical production system is based on a data warehouse approach, different aggregate data on different topics should not be produced independently from each other but as integrated parts of a comprehensive information system. In this case statistical concepts and infrastructures are shared, and the data in a common statistical domain are stored once for multiple purposes. In all these cases, we define the statistical production approach as Statistical Data Warehousing (S-DWH) which is the central part of the whole IT infrastructure for supporting statistical production.

This corresponds to a system able to manage several kinds of data (micro, macro and meta) from different phases of a statistical production process. We must then identify a single entry point for managing different phases of a production process.

A few examples of benefits you could hope for with a well structured data warehouse are:

-Good data quality is achieved by consistent codes and descriptions

-Existing data can easily be combined and reused to create new outputs

-Historic data is maintained and accessible

-The data is structured to support analytic queries and to present the organization's information in a consistent way

-Different users can work in a single common environment

To describe a S-DWH, as a whole IT infrastructure, we will use the methodological approach used for Enterprise Architectures, i.e. the process of translating business visions and strategies into a structured documentation schema. This model, from an IT point of view, is generally articulated in three main architecture domains: Business, Information Systems , Technology(ISO/IEC/IEEE[1] 42010). Where the Business Architecture describes business process collection of structured activities or tasks that produce specific functionalities to deliver services or products for a particular customer or customers.

To describe the main high level functionalities of the S-DWH from users' viewpoints we will introduce a Functional Architecture diagram, this will be described by the Generic Statistical Information Model (GSIM), using the Generic Statistical Business Process Model (GSBPM) convention when needed. The GSIM is a reference framework of internationally agreed definitions, attributes and relationships that describe the pieces of information that are used in the production of official statistics (information objects).

2.S-DWH Functional Architecture

When the statistical production system is based on a data warehouse approach, different aggregate data on different topics should not be produced independently from each other but as integrated parts of a comprehensive information system. In this case statistical concepts and infrastructures are shared, and the data in a common statistical domain are stored once for multiple purposes. In all these cases, we define the statistical production approach as Statistical Data Warehousing (S-DWH) which is the central part of the whole IT infrastructure for supporting statistical production.

In the following we will define a generic S-DWH as: a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI to:

•(re)use data to create new data/new outputs;

•perform reporting;

•execute analysis;

•produce the necessary information.

This corresponds to a system able to manage several kinds of data (micro, macro and meta) across different phases of a statistical production process and different statistical domains;

To substitute a classic stove-pipe-like production process with a platform where different statistical domains are managed in a common environment means moving towards a new business architecture in official statistic production. In this way we have the whole of the target population in order to integrate data and processes like capturing, checking and then estimating data.

2.1S-DWH management functionalities

To manage over-arching statistical processes, the GSBPM recognizes nine over-arching statistical processes needed to support the nine phases of the generic statistic business processes; they are:

I.Statistical program management
This includes systematic monitoring and reviewing of emerging information requirements and emerging and changing data sources across all statistical domains. It may result in the definition of new statistical business processes or the redesign of existing ones;

II.Quality management
This process includes quality assessment and control mechanisms. It recognizes the importance of evaluation and feedback throughout the statistical business process;

III.Metadata management
Metadata are generated and processed within each phase, there is, therefore, a strong requirement for a metadata management system to ensure that the appropriate metadata retain their links with data throughout the different phases;

IV.Statistical framework management
This includes developing standards, for example methodologies, concepts and classifications that apply across multiple processes;

V.Knowledge management
This ensures that statistical business processes are repeatable, mainly through the maintenance of process documentation;

VI.Data management
This includes process-independent considerations such as general data security, custodianship and ownership;

VII.Process data management
This includes the management of data and metadata generated by and providing information on all parts of the statistical business process. (process management is the ensemble of activities of planning and monitoring the performance of a process;); operations management is an area of management concerned with overseeing, designing, and controlling the process of production and redesigning business operations in the production of goods or services.

VIII.Provider management
This includes cross-process burden management, as well as topics such as profiling and management of contact information (and thus has particularly close links with statistical business processes that maintain registers);

IX.customer management – This includes general marketing activities, promoting statistical literacy, and dealing with non-specific customer feedback.

In addition to the GSBPM statistical MPs, which are specifically for an integrated system infrastructure for business statistics, we should include five more over-arching processes:

X.S-DWH Management –

XI.Data Capturing Management

XII.Output Management

XIII.Web-Portal Management– this is a thematic web portal able to support specialized customer care for web-questionnaire compilation as well as for general marketing activities, promoting statistical literacy, and dealing with non-specific customer feedback;

XIV.Business Register Management (or for institutions or civil registers) – this is a trade register kept by the registration authorities and is related to provider management and operational activities.

By definition, a S-DWH system includes all effective processes needed to carry out any production process. Web portal management is the contact point between respondents and NSIs. It supports several phases of the statistical business process, from collecting to disseminating, and at the same time provides the necessary support for respondents.

The BR Management is an overall process since the statistical, or legal, state of any enterprise is archived and updated at the beginning and end of any production process.

3.Statistical Business Process

The challenge in data warehouse environments is to integrate, rearrange and consolidate large volumes of data from different sources to provide a new unified information base for business intelligence. To this aim, we propose a generic business process on the S-DWH divided in four specialized for group of functionalities each specialized in a data layer. The metadata used and produced in the different layers of the warehouse are defined in the Metadata framework [2]and the Micro data linking[3] documents.

3.1Source layer

The Source layer is the gathering point for all data that is going to be stored in the Data warehouse. Input to the Source layer is data from both internal and external sources. Internal data is mainly data from surveys carried out by the NSI, but it can also be data from maintenance programs used for manipulating data in the Data warehouse. External data means administrative data which is data collected by someone else, originally for some other purpose.

The structure of data in the Source layer depends on how the data is collected and the designs of the various data collection processes. The specifications of collection processes and their output, the data stored in the Source layer, have to be thoroughly described. Examples of vital information are name, description and definition of a collected variable. Also the collection process itself must be described, for example the source of a collected item, when it was collected and how.

Activities according to the GSBPM model the Integration layer primarily supports are

-4.2 Set up collection. Ensures that the processes and technology are ready to collect data by

•preparing web collection instruments

•training collection staff

•ensuring collection resources are available e.g. laptops

•configuring collection systems to request and receive the data

•ensuring the security of data to be collected

-4.3 Run collection. The collection is implemented with different collection instruments being used to collect the data;

-4.4Finalize collection which includes loading the collected data into a suitable electronic environment for further processing of the next layers, if necessary.


3.2Integration layer

From the Source layer, data is loaded into the Integration layer. The process of extracting data from source systems and transform it into useful content in the data warehouse is commonly called ETL, which stands for Extract-Transform-Load.

In the Extract step data is moved from the Source layer and made accessible in the Integration layer for further processing.

The Transformation step involves all the operational activities usually associated with the typical statistical production process, examples of activities carried out during the transformation are:

•Find, and if possible, correct incorrect data

•Transform data to formats matching standard formats in the data warehouse

•Classify and code

•Derive new values

•Combine data from multiple sources

•Clean data, that is for example correct misspellings, remove duplicates and handle missing values

To accomplish the different tasks in the transformation of new data to useful output, data already in the data warehouse is used to support the work. Examples of such usage are using existing data together with new to derive a new value or using old data as a base for imputation.

Each variable in the data warehouse may be used for several different purposes in any number of specified outputs. That means that each variable has to be treated as an individual value rather than a small part of a specific survey. As soon as a variable is processed in the Integration layer in a way that makes it useful in the context of data warehouse output it has to be loaded into the Interpretation layer and the Access layer.

The Integration layer is an area for processing; it is not accessible for end users and has no query or presentation interface besides the tools for the ETL process. Also, the data is only changeable through the ETL tools. When the transformation process is finished, the data is loaded into the Interpretation layer and the Access layer.

Since the focus for the Integration layer is on processing rather than search and analysis, data in the Integration layer is stored in a strictly normalized structure. Normalized data where each value is stored at exactly one point makes it easier to maintain consistent data in an environment with lots of changes.

During the ETL process a variable will likely appear in several versions. Every time a value is corrected or changed by some other reason, the old value should not be erased but a new version of that variable should be stored. That is a mechanism used to ensure that all items in the database can be followed over time.

Integration layer primarily supports GSBPM activities in phase 5- Collect and 6- Analyze, but also activities in phase 4- Collect and phase 7- Disseminate.

Activities according to the GSBPM model the Integration layer primarily supports are

-4.2 set up collection

-5.1 Integrate data. This sub-process integrates data from one or more sources. Data integration typically includes record linkage routines and prioritizing, when two or more sources contain data for the same variable with potentially different values.

-5.2 classify & code. For example automatic coding routines may assign numeric codes to text responses according to a pre-determined classification scheme.

-5.3 review, validate & edit. Inspection of each record to identify and where necessary correct potential problems, errors and discrepancies such as outliers, item non-response and miscoding. It may be run iteratively, validating data against predefined edit rules, usually in a set order.

-5.4 impute. Carried out when data are missing or unreliable. Estimates may be imputed, often using a rule-based approach.

-5.5 derive new variables and statistical units. Derivation of new variables and statistical units from existing data using logical rules defined by statistical methodologists.

-5.6 calculate weights. Creation of weights for unit data records according to the defined methodology automatically applied for each iteration.

-5.7 calculate aggregate. Creation of already defined aggregate data from micro-data.

-5.8 finalize data files. Compilation of the results of the production process, which will be used as input for dissemination.

-6.1 prepare draft output. The information produced is transformed into statistical outputs. Generally, it includes the production of additional measurements such as indices, trends or seasonally adjusted series, as well as the recording of quality characteristics.

-7.1 update output systems. Update of content in the Interpretation layer

-7.2 produce dissemination. Production of previously designed statistical products Typical steps include

•preparing the product components (explanatory text, tables, charts etc)


assembling the components into products

•editing the products and checking that they meet publication standards

3.3Interpretation layer

The Interpretation layer is the part of the data warehouse people probably are thinking about when talking about data warehouses. This layer contains all collected data processed and structured to be optimized for analysis and as a base for output planned by the NSI.

The Interpretation layer is specially designed for statistical experts and is built to support data manipulation and big complex search operations. Typical activities in the Interpretation layer are hypothesis testing, data mining and design of new statistical strategies. The Interpretation layer is also where the designs of data cubes for use in Access layer are defined.

The Interpretation layer will contain sums, aggregations and calculated values, but it will still also contain all data at the finest granular level in order to be able to cover all possible queries. A fine granularity is also a condition to manage changes of required output over time.

Also versions of variables should be stored to enable the data warehouse to support historic data. When a variable is loaded from Integration layer to Interpretation layer, the existing value for the variable should not be replaced. Instead a new version of the variable should be created.

Besides the actual data warehouse content, the Interpretation layer may contain temporary data structures and databases created and used by the different ongoing analysis projects carried out by statistics specialists.

The ETL process continuously creates metadata regarding the variables and the process itself that is stored as a part of the data warehouse.

Data in the Interpretation layer is stored in a dimensional structure to support data analysis in an intuitive and efficient way. A dimensional model is easy for the user to understand and gives good performance for search operations.

Dimensional models are de-normalized structures with fact tables and their belonging dimensions. Facts are generally numeric, and dimensions are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts, such as the number of products ordered and the price paid for the products, and into dimensions, such as order date, customer name and product number.

Activities according to the GSBPM model the Interpretation layer primarily supports are

-1.5 check data availability.

Assessment of whether current data sources can meet the requirements, and the conditions under which they would be available, including any restrictions on their use. The investigation of possible alternatives would normally include research of potential administrative data sources, to determine whether they would be suitable for statistical purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data requirement is prepared. This sub-process also includes a more general check of the legal framework in which data would be collected and used, and may therefore identify proposals for changes to existing legislation or the introduction of a new legal framework.