ESSnet on Data Warehousing
Harry Goossens, Jos Dressen, Michel Lindelauf (CBS)

in partnership with

Title: / Overview metdata models
WP: / 1 - Metadata / Deliverable: / 1.3
Version: / 0.4 (draft) / Date: / 9 May 2012
Autors: / Jos Dressen, Michel Lindelauf, Harry Goossens / NSI: / Statistcics Netherlands (CBS)

ESS - NET

On micro data linking and data warehousing
in production OF BUSINESS STATISTICS

INDEX

1. Introduction 2

2. The use of metadata models and standards 4

2.1 International models and standards 4

2.2 Relevance 7

2.3 Subset mapping 8

3. Best practice cases 9

Appendix 1 13

1. Introduction

In the statistical data warehouse (S-DWH) the metadata satisfies 2 essential needs:

a.  to guide statisticians in processing and controlling the statistical production

b.  to inform end users by giving them insight in the exact meaning of statistical data

In order to meet these 2 essential functions, the statistical metadata must be:

correct and reliable (the metadata must give a correct picture of the statistical data), consistent and coherent (the metadata driving the statistical processes and the reporting metadata presented to the end users must be compatible with each other)
standardised and coordinated (the data of different statistics are described and documented in the same standardised way)

Since the different users of the (meta)data have diverse needs, it is essential to ensure an effective management of the statistical metadata in the S-DWH. To realise this, the use of a metadata model is a key element in structuring and standardising the statistical metadata within a NSI in a generic way.

In the Metadata framework[1] (deliverable 1.1) the roles and purposes, definitions etc. of metadata in the statistical data warehouse are defined in generic terms. The framework defines a metadata model as follows:

[Def 3.6.1] A metadata model is a special case of a data model:
an abstract documentation of the structure of metadata
used by business processes.

In the context of the S-DWH at least 2 types of metadata models can be distinguished:

2.2  as conceptual model that usually gives a high-level overview on how the metadata is organised, managed, maintained etc.

2.2  as physical model that describes the details of the metadata objects and attributes, including relations between the metadata objects.

More simple, you could say that a conceptual metadata model is a description of the overall metadata process(es), where the physical model is a structured description of the metadata elements.

In the context of the term (metadata)model also the term standard needs to be reconsidered, as they are often used in relation or even mixed. The following general definition of a model is commonly accepted:

‘A model is a simplified description of an analogue part of the reality.’

For the term standard, often also norm is used as a synonym. The following general definition of standard/norm is commonly accepted:

‘A standard or norm is a document with recognized agreements, specifications or criteria about a product, service or method.’

Looking at the coherence of and/or the differences between both terms a standard/norm generally defines WHAT to be done, a model describes HOW to do it.


For example:

the standard/norm ISO 11179 is a international standard defining the representation of metadata in a metadata registry, without a physical representation.

whereas

the Nordic Metadata model provides a basis for organizing and managing metadata, as it describes the metadata systems that are being used in NSIs

In the context of the S-DWH, a metadata model is a standardized representation used to define all necessary metadata elements of statistical information systems, based upon and using 1 or more standards/norms. In these implementations, standards act as checklists for controlling the completeness and correctness of all metadata elements as described by the model.

In this document we focus on the use of metadata models and standards, providing a framework for capturing, maintaining and understanding the metadata when describing statistical data.

2. The use of metadata models and standards

In 2011 this ESSnet sent out a questionnaire for the stocktaking on best practices in ESS-member states, which also included some questions about metadata. One of the issues mentioned as most important to focus on when developing a DWH was metadata to drive the process

Remarkably, almost all NSI mention on the one hand that metadata are “important” or “extremely important” for DWH systems. But on the other hand, 19 of the 24 NSIs admit that meta data are currently implemented in only a few systems. Further inquiries revealed that one reason for this apparent contradiction is that current metadata-models are considered as complex and cumbersome to deal with. Hence, one challenge of the ESSnet might be to provide recommendations about relatively easy to manage metadata models, which allow us to drive DWH systems.

From the metadata perspective it is the ultimate goal to use one single model for statistical metadata, covering the total life-cycle of statistical production. But considering the great variety in statistical production processes (e.g. surveys, micro data analysis or aggregated output), all with their own requirements for handling metadata, it is very difficult and not very likely to agree upon one single model. Biggest risk is duplication of metadata, which you want to avoid of course. This best can be achieved by the use of standards for describing and handling statistical metadata.

2.1 International models and standards

Chapter 1.3 of the Metadata Framework briefly describes the models and standards considered most relevant for a S-DWH. Part B of the Common Metadata Framework of The Metis Group gives a more complete overview of concepts, standards, and models:

http://www1.unece.org/stat/platform/display/metis/The+Common+Metadata+Framework

The most important standards in relationship to the use of metadata models are:

▪  ISO / IEC 11179-3 [2]
ISO/IEC 11179 is a well established international standard for representing metadata in a metadata registry. It has two main purposes: definition and exchange of concepts. Thus it describes the semantics and concepts, but does not handle physical representation of the data. It aims to be a standard for metadata-driven exchange of data in heterogeneous environments, based on exact definitions of data.
In particular Part 3 : Registry metamodel and basic attributes
Primary purpose of part 3 is to specify the structure of a metadata registry and also to specify basic attributes which are required to describe metadata items, which may be used in situations where a complete metadata registry is not appropriate.

▪  Neuchâtel Model - Classifications and Variables
The main purpose of this model is to provide a common language and a common perception of the structure of classifications and the links between them. The original model was extended with variables and related concepts. The discussion includes concepts like object types, statistical unit types, statistical characteristics, value domains, populations etc.
The two models together claim to provide a more comprehensive description of the structure of statistical information embodied in data items.
Intended use: For setting up metadata models and frameworks inside statistical offices several models are used as a source or starting point. The Neuchâtel model is one of those models.
References - Classifications: http://www1.unece.org/stat/platform/download/attachments/14319930/Part+I+Neuchatel_version+2_1.pdf?version=1

References - Variables: http://www1.unece.org/stat/platform/download/attachments/14319930/Neuchatel+Model+V1.pdf?version=1

▪  Corporate Metadata Repository Model (CMR)

This statistical metadata model integrates a developmental version of edition 2 of ISO/IEC 11179 and a business data model derivable from the Generic Statistical Business Process Model.
It includes the constructs necessary for a registry. Forms of this model are in use at the US Census Bureau at Statistics Canada.

Intended use: The model is a framework for managing all the statistical metadata of a statistical office. It accounts for survey, census, administrative, and derived data; and it accounts for the entire survey life-cycle.
References:
http://www.unece.org/stats/documents/1998/02/metis/11.e.pdf for overview paper on the subject. See also Gillman, D. W. "Corporate Metadata Repository (CMR) Model", Invited Paper, University of Edinburgh -Proceedings of First MetaNet Conference, Voorburg, Netherlands, 2001.

Relationships to other standards:

ISO/IEC 11179 and Generic Statistical Business Process Model

▪  Nordic Metamodel, version 2.2

The Nordic Metamodel was developed by Statistics Sweden, and has become increasingly linked with their popular "PC-Axis" suite of dissemination software. It provides a basis for organizing and managing metadata for data cubes in a relational database environment.

Intended Use: The Nordic Metamodel is used to describe the metadata system behind several implementations of PC-Axis in national and international statistical organizations, particularly those using MS SQL Server as a platform.

Maintenance organization: Statistics Sweden (with input from the PC-Axis Reference Group)

References: PC AXIS SQL metadata base

▪  Common Warehouse Metamodel (CWM)

Specification for the metadata in support of exchange of data between tools.

Intended use: As a means for recording the metadata to achieve data exchange between tools.

Maintenance organization: OMG - Object Management Group

ISO Standard Number: ISO/IEC 19504

References: See OMG web site (http://www.omg.org), and specifically http://www.omg.org/technology/documents/formal/cwm_mip.htm

▪  SDMX
Statistical Data and Metadata eXchange, SDMX, was initiated by seven international organisations to foster standards for the exchange of statistical information. SDMX has its focus on macro data, even though the model also supports micro data. It is an adopted standard for delivering and sharing data between NSIs and Eurostat. Sharing the results from the latest Population Census is perhaps the most advanced example, so far.


Recently SDMX more and more has evolved to a framework with several sub frameworks for specific use:

-  ESMS

-  SDMX-IM

-  ESQRS

-  MCV

-  MSD

▪  References: See SDMX web site ( http://sdmx.org ), and specifically http://sdmx.org/?page_id=10 for standards

▪  DDI
The Data Documentation Initiative (DDI) has its roots in the data archive environment, but with its latest development, DDI 3 or DDI Lifecycle, it has become an increasingly interesting option for NSIs. DDI is an effort to create an international standard for describing data from the social,behavioural, and economic sciences. It is based on XML. DDI is supported by a non-profit international organisation, the DDI Alliance.
References: http://www.ddialliance.org

▪  GSIM
The Generic Statistical Information Model (GSIM) is a reference framework of information objects, which enables generic descriptions of data and metadata definition, management, and use throughout the statistical production process. As a common reference framework for information objects, the GSIM will facilitate the modernisation of statistical production by improving communication at different levels:

-  Between the different roles in statistical production
(statisticians, methodologists and information technology experts);

-  Between the statistical subject matter domains;

-  Between statistical organisations at the national and international levels.

The GSIM is designed to be complementary to other international standards, particularly the Generic Statistical Business Process Model (GSBPM). It should not be seen in isolation, and should be used in combination with other standards.
References:
Website
http://www1.unece.org/stat/platform/display/metis/Generic+Statistical+Information+Model+(GSIM)

GSIM Version 0.3

http://www1.unece.org/stat/platform/download/attachments/65373325/GSIM+v0_3.doc?version=1

▪  MMX metadata framework
The MMX metadata framework is not an international standard, it is a specific adaptation of several standards by a commercial company. The MMX Metamodel provides a storage mechanism for various knowledge models. The data model underlying the metadata framework is more abstract in nature than metadata models in general. The MMX framework is used by Statistics Estonia, so it needs to be considered from the point of practical experiences.

Appendix 1 gives a more comprehensive and thorough overview of models and standards.

2.2 Relevance

As not all models/standards we enlisted are relevant in the context of the S-DWH, we made a selection of the ones who are and need to be study more in depth. For this we used the following 4 selection criteria:

Topicality / Date of last change/last reference on the internet ?
Are there (still) new developments of the model/standard ?
Support / Is there an organisation that is in charge of the maintenance of the standard/model ?
Usage / How extensive is the usage of the model ?
Are there many / few users ?
Usability / Is the model/standard difficult or easy to use ?
Do we think it is usable in of the S-DWH ?

We made a first selection of relevance by scoring each model on the 4 categories:

Criteria / Advise
topicality / support / usage / usability
Models / Standards / ISO/IEC 11179-3 / + / +/- / +/- / +/- / relevant
Neuchâtel Model / +/- / +/- / + / +/- / relevant
CMR / - / - / +/- / - / not relevant
Nordic Metamodel / + / + / + / + / relevant
CWM / - / - / +/- / - / not relevant
SDMX: / + / + / + / ? / relevant
* SDMX-IM / +/- / +/- / +/- / +/- / relevant
* EPMS / ? / ? / ? / ? / unclear
* ESMS / + / ? / ? / ? / relevant
* ESQRS / + / ? / ? / ? / relevant
* MCV / ? / ? / - / ? / not relevant
*MSD / ? / ? / ? / ? / unclear
DDI / + / +/- / +/- / ? / relevant
GSBPM / + / +/- / +/- / ? / relevant
GISM / + / +/- / +/- / ? / relevant
MMX / +/- / +/- / +/- / +/- / relevant
Legend
+ / good
+/- / moderate
- / not good
? / more research needed


In this first selection we made following considerations:

▪  A model/standard is rated relevant if it has at least 1 ‘+’

▪  A model/standard is rated not relevant if it has at least 1 ‘-’

▪  A model/standard is rated unclear if it has mainly ‘?’

▪  If a model/standard has mainly ‘+/-‘ we considered the overall context:

-  SDMX-IM is rated relevant as it is a subset of SDMX

-  MMX is rated relevant as it is a key element in the new S-DWH of Statistics Estonia, and we want to consider it in the general discussion.

2.3 Subset mapping

For the (possible) use in the S-DWH it is necessary to first map the relevant models/standards on the metadata subsets from the framework. Goal is to indicate for each subset which model/standard is to be considered and useful.

In this mapping the GSBPM is not matched as WP 3 has made a mapping of the GSBPM on the
S-DWH (deliverable 3.1).

Metadata subsets
Statistical / Process / Quality / Technical / Authorisation
Model / Standard / ISO/IEC 11179-3 / no / yes / no / no / ?
Neuchâtel Model / yes / no / no / no / ?
Nordic Metamodel / yes / no / no / no / ?
SDMX: / yes / ? / no / yes / ?
* SDMX-IM / no / no / no / yes / ?
* ESMS / yes / yes / ? / no / ?
* ESQRS / no / no / yes / no / ?
DDI / yes / no / no / yes / ?
GSIM / yes / yes / no / ? / ?
MMX / yes / yes / no / yes / yes

3. Best practice cases

Based upon the information from the stocktaking, uniform best practice descriptions are made for the specific NSIs, with special focus on the use, role and function of metadata in the S-DWH. This case studies provide good insight into the NSIs with the most developed metadata systems. For more specific information on the (possible) use of a metadata model, further research will be performed.