SDMX Data Modeling Tutorial

Handbook for statisticians on how to use SDMX

Contents

1. Introduction 4

2. Microdata and macrodata, the statistical characteristic 4

2.1 Microdata and Macrodata 4

2.2 Statistical characteristic and object characteristic 5

3. The SDMX Information Model – principles 7

3.1 Reference metadata and structural metadata in SDMX 8

4. Comparison between SDMX and Statistical concepts 9

4.1Data Structure definition and related artefacts 9

4.2 Statistical concepts 10

4.3 Statistics and SDMX 11

5. Modelling data 12

5.1 From a statistical multi-dimensional table to a DSD 13

5.2 From a statistical survey to a Data Structure Definition 15

5.2.1 Definition of concepts 18

5.2.2 Measure 19

5.2.3 Dimensions 19

5.2.4 Attributes 20

5.2.5 Code lists 21

5.2.5.6 Definition of DSD and mapping of data 26

6. SDMX in practice 28

7. Conclusion 29

8. Appendix 29

5 Appendix A – Bibliography 32

1. Introduction

SDMX supports the statistical process not only in the dissemination phase, through the definition of standard data and metadata messages, but also in the data modelling phase, through guidelines for the definition of metadata defining data (structural metadata)[1]. The main aim of this handbook is to introduce statisticians on how to model statistical aggregated data using SDMX.

This handbook is aimed at people working in the production services but it is also intended to those metadata experts that are responsible for variable and classification definitions used for data dissemination. It is also addressed to those readers already having some knowledge of the SDMX standard and that want to have only a hands-on approach on data modelling. Any depth investigation on SDMX is left to the official documentation.

The reader will learn how to model data by a practical example of survey: the Italian Labour Force Survey.

Chapter 2 contains two extracts about “The statistical business process”

Chapter 3 describes some artefacts from the SDMX Information Model

Chapter 4 represents the “bridge” between a statistical cube and SDMX.

Chapter 5 shows the modelling of data using the Istat Labour Force Survey.

Chapter 6 Show some tools that can be useful in this context

2. Microdata and macrodata, the statistical characteristic

This chapter quotes some extracts from two documents about the data modelling and the statistical business architecture. This should be considered as an introduction to the statistical characteristics of aggregated data.

2.1 Microdata and Macrodata

“[…] Microdata are data about individual objects (persons, companies, events, transactions, etc). Objects have properties which are often expressed as values of variables of the objects. For example, a ”person” object may have values of variables such as ”name”, ”address”, ”age”, ”income”. Microdata represent observed or derived values of certain variables for certain objects.

Macrodata, "statistics", are estimated values of statistical characteristics concerning sets of objects, "populations". A statistical characteristic is a measure that summarizes the values of a certain variable of the objects in a population. ”The average age of persons living in OECD countries” is an example of a statistical characteristic. Some statistical characteristics, e.g. correlations, summarise the values of more than one variable. Macrodata represent estimated values of statistical characteristics. Estimated values deviate from true values because of different imperfections (errors and uncertainties) in the underlying observation (measurement) and derivation processes. The difference between ”estimated” and ”true” values is an issue not only on the macro level, but also on the micro level, since the observed (measured) values deviate from the true values because of measurement errors.

Statistical metadata are data describing different quality aspects of statistical data, e.g.

· contents aspects, describing definitions of objects, populations, variables, etc;

· accuracy aspects, describing different kinds of deviations between observed/estimated and true values of variables and statistical characteristics;

· availability aspects, describing which statistical data are available, where they are located, and how they can be accessed.” (United Nations – 1999)

2.2 Statistical characteristic and object characteristic

“[…]A statistical characteristic is defined by a triple <O, V, f>

where

· O is a set of objects (or object vectors), called a population;

· V is a variable (or a vector of variables) having values for the objects in the population;

· f is an operator, called a statistical measure, producing a value f(O, V) for the population from the values of the variables for the objects in the population.

Typical examples of statistical measures are frequency count, sum, average, and variance.

The population is often structured into subpopulations, for which estimates are produced as well.

Time usually plays an important role in the definition of a statistical characteristic. The population is often defined as the set of objects of a certain type, having a certain property (or combination of properties) in common at a certain point of time. Alternatively, the population can be defined as the set of objects of a certain type that have been born, lived, or died during a certain time period, e.g. the events of a certain type that have occurred during a certain year, or the processes of a certain kind that have started, been on-going, or stopped during a certain month.

The variable V must usually be qualified by a time parameter, too, in order to ensure that every object in the population is associated with a unique value (or set of values, in the case of multivalued variables). If V is a set of variables, all the variables may be separately qualified by (possibly different) time parameters.

Some examples of statistical characteristics:

· the number of people living in Canada at the end of 1996;

· the average income of people living in France at the end of 1996;

· the total value in current US dollars of the production of commodities in the United States during the first quarter of 1996;

· the number of road accidents that have occurred in Germany during 1996;

· the average length of hospital treatments that was on-going in Holland during (at least) some part of 1992;

· the average percentage increase/decrease between 1995 and 1996 in the annual income of people living in Sweden during the whole of the two-year period 1995-1996.

The (true) value of a statistical characteristic is derived (by means of an aggregation process) from the (true) values of one or more sets of object characteristics. The estimated value of a statistical characteristic is derived (by means of another aggregation process, called the estimation procedure) from the observed values of (possibly the same) sets of object characteristics, the so-called observation characteristics.

An object characteristic is defined by an ordered pair <O, V>

where

· O is a set of objects (or object vectors), called a population;

· V is a variable (or an object relation), having values for the objects in the population.

Time plays a similar role in the definition of an object characteristic as it does for the definition of a statistical characteristic.

Each object (or object vector) in the population is associated with one instance of the object characteristic. At any particular time t, each object (or object vector) in the population is associated with a unique value of V (or with a unique set of values of V in the case of multivalued variables).” (United Nations – 1995)

3. The SDMX Information Model – principles

The SDMX Information Model (SDMX-IM) provides a broad set of formal objects to model statistical data and metadata. Figure 1 shows the high level SDMX artefacts detailed in the SDMX-IM.

Figure 1- SDMX-IM

· A Data (metadata) Provider is a statistical organization that provides data and metadata to other organizations that act as Data (metadata) Collectors;

· The exchange is often based on an agreement (Provision Agreement) between the Provider and the Collector. A Provision Agreement specifies which (and when) data (metadata) set has to be exchanged between the Data Provider and the Data Collector;

· The Data (Metadata) Structure Definition (DSD/MSD) specifies a set of concepts which describe and identify a set of data (metadata);

· The Data and Metadata flows represent the containers defined by the DSD or MSD for the sets of data and metadata (Data Set and Metadata Set)

· The Category and Category Scheme represent a way of grouping Data Sets in a common subject theme.

This tutorial is focused on how to build a Data Structure Definition according to the SDMX description of statistical aggregated data. Therefore, the Metadata Structure Definition is demanded to other manuals and tutorials that can be found on the SDMX.org website.

3.1 Reference metadata and structural metadata in SDMX

In SDMX, we can find Structural metadata and Reference metadata. Figure 2 shows the relationship between them and Data or Metadata Set.

Figure 2- Relationship between DSD & MSD

a DSD describes the information structure within a specific statistical domain, thus allowing a full complete description of a set of data if all values are given. A limited number of specific concepts are needed for DSDs to function properly;

a MSD describes how a metadata set is organized. It particularly defines which reference metadata are being compiled, how the concepts are related to each other, how they are represented (either as free text or coded values) and with which object types (agencies, data flows, data providers, etc.) they are associated.

Structural metadata act as identifiers and descriptors of data sets and reference metadata sets. Therefore, DSD and MSD consist of structural metadata.

Reference (or explanatory) metadata describe the contents and the quality of the statistical data (conceptual metadata, describing the concepts used and their practical implementation, methodological metadata, describing methods used for the generation of the data, e.g. sampling, collection methods, editing processes, and quality metadata, describing the different quality dimensions of the resulting statistics, e.g. timeliness, accuracy).

4. Comparison between SDMX and Statistical concepts

4.1Data Structure definition and related artefacts

The Data Structure Definition is “a set of concepts which describe and identify a set of data. It tells which concepts are dimensions and which are attributes, and it gives the attachment level for each of these concepts, based on the packaging structure (Data Set, Group, Series, Observation) as well as their status (mandatory versus conditional). It also specifies which code lists provide possible values for the dimensions, as well as the possible values for the attributes, either as code lists or free text fields [….]”(SDMX Initiative – 2005).

Figure 3 shows the relationship between DSD, Concepts, Code lists, Dimensions, Attributes and Measures.

Figure 3 - DSD artefacts

Concepts are those SDMX artefacts necessary to interpret the data and they can be distinguished in:

· Dimensions represent those concepts that identify and describe the data;

· Attributes represent statistical concepts providing qualitative information about a specific statistical object such as a data set, observation, data provider, or dataflow. Concepts such as units, currency of denomination, observation status, titles and methodological comments can be used as attributes in the context of an agreed data exchange (from MCV[2]).

· Measures represent the measure of the phenomenon or phenomena;

Code lists represent a collection of items that provide each possible value for concepts. The Codelists are characterized by a set of codes and a descriptions and in case of hierarchical codelist every code is associated to another code of the same codelist (parent code) to establish a simple hierarchy.

4.2 Statistical concepts

How does SDMX fit with statistics? The following list is just a rule of thumb, but it can turn out useful for statisticians in order to better organize all the elements that define a table.

Among the concepts, it is necessary to include these sets of elements, which have already been introduced in par 2.2:

· statistical variables: they include those variables to which the operator is applied (named V in par. 2.3).

· statistical measures/operator (named f in par. 2.3): this set of concepts should include the statistical operators (average, total, numerosity, index number,…) and their possible characteristics (base year for the index number, adjustment for those data that can be seasonally adjusted,…). Furthermore, this set could also include those concepts which are used to represent (number of decimals; unit multiplier:…), measure (unit of measure;…) and disseminate (for definitive data or not; for estimated data or not; for forecasted data or not;…) data;

· statistical population (named O in par. 2.3): this set describes the reference group of elements over which the statistical variables are observed and the statistical operator computes the corresponding data.

4.3 Statistics and SDMX

How can SDMX be interpreted in a statistical sense for aggregated data? Where can we position the notions of “population”, “statistical variable”, “statistical measure/operator” in a DSD?

· the variables (V) should always be used as dimensions or as elements of a measure dimension when applied to an operator. In statistics, they are the dimensions of a multivariate distribution;

· the statistical operator (f) always describes data measures. Therefore, it would be useful to describe it (together with variable to which it is applied) as a measure dimension (if not a primary measure); sometimes the statistical operator can be either defined as a dimension or an attribute;

· the concepts that are used to better define the statistical operator can be either dimensions or attributes, depending on their role in the definition of data;

· the population of interest (O) is usually a neglected element in a DSD. Anyway, it can find a place among the attributes.

The comparison between statistical characteristics and SDMX artefacts is summarized in the table below:

Statistical concepts / id / description / SDMX concepts
variables / V / phenomena investigated on the population of interest / Dimensions
statistical operator / f / statistical operators like average, total, index number, i.e. / Dimensions or attributes
statistical operator applied to the variable and to the population of interest / f(V,O) / Average of hours worked, total number of employee, index number of industrial production, i.e. / Measure Dimension
characteristics of statistical operator / Characteristics of statistical operator like base year, adjustment, i.e. / Dimensions or Attributes
population of interest / O / reference group of elements over which the statistical variables are observed and the statistical operator computes / it can be represented by an attribute, but it is not always defined in a DSD explicitly,

Table 1 – Comparison between statistical characteristics and SDMX artefacts