On Micro Data Linking and Data Warehousing in Statistical Production

ESSnet

On micro data linking and data warehousing
in statistical production

Coordinator:Netherlands

Partners:Estonia, Italy, Lithuania, Portugal, Sweden, UK

MINUTES
Subject: / 4th Workshop of the ESS-net on Data Warehousing
Place / Date / Tallinn, March20 – 21, 2013
Status / Final

Agenda day 1Wednesday 20 – 3 -2013

1. / Opening / Harry Goossens (CBS)
2. / Introduction Quiz / Pete Brodie (ONS)
3. / Presentations WP3:
▪Recap of the S-DWH Architecture
▪Modular Workflow /
Antonio Laureti Palma (ISTAT)
Allan Randlepp (Estonia)
4. / Presentation Metadata System / Maia Ennok (Estonia)
5. / Presentation Metadata Quality / Colin Bowler (ONS)
6. / Interactive session Metadata Dimensions / All, subgroups
7. / Presentation Role/Position Business Register / Pieter Vlag (CBS)
8. / Presentation Data Linking / Jurga Ruksenaite (SL)
9. / Discussion & Feedback Session / Harry / Pete
10. / Quiz nr. 2

Welcome / OpeningHarry Goossens

Maia Ennok opens the Workshop on behave of Statistics Estonia. She welcomes everyone in Tallinn and hopes that the next 2 days will be productive and informational to everyone.

Harry also welcomes everyone and first wants to thank Statistics Estonia very much for organising and hosting the workshop.

A special welcome also to mrs. Martina Hahn from Eurostat, Head of Unit G1 – Business Statistics.

Again it is good to see a lot of participants who also were present at the previous workshops and to welcome a substantial number of new participants. It confirms to us that the interest in Data Warehousing is still high.

After evaluating the last Workshop in Cardiff we concluded that we probably were too ambitious regarding the interactive sessions. But still we believe that it is of great importance to

Therefore we looked for a new way of working to get input from MS by using voting pads.

All presentations will be uploaded to the CROS-portal:

2.Introduction QuizPete Brodie

A small introductionairy quiz is prepared by Pete Brodie (ONS) to explain how the voting pads work.
Main goal of using this equipment is to get quick feedback during the presentations of the various topics and analyse it almost right away. To make even more attractive to answer, a small price is at stake for the participant that collects most points.

Also some opinion-questions will be asked to gather input in order to prepare some other sessions.

3.Presentations WP3:

Recap S-DWH ArchitectureAntonio Laureti Palma

▪Modular WorkflowAllan Randlepp

Antonio gives a recap on the backgrounds of the S-DWH business architecture as developed by the ESSnet.

To define and enable the development of a S-DWH requires a definition of a framework where key principles and models can be created, communicated and improved.

The effected business domains are explained, where a S-DWH framework comprises:

▪The business domain to align strategic objectives and tactical demands through a common understanding of the organization;

▪The information architecture domain, to describe data base organization and management of data and metadata information;

▪The technology domain, i.e. the combined set of software, hardware and networks to develop and support IT services.

The layered architecture is explained using the in each layer covered GSBPM sub phases.

Next, Allan presents the first version of the modular workflow of the S-DWH, with a specific focus on the reuse of Data. Comparing the S-DWH with a traditional siloed model (‘stovepipe’) it is clear that one of the major disadvantages of the stovepipe model is that the reuse of data is very difficult.

A twofold integrated model for producing statistics is introduced.

▪Horizontal integration across statistical domains at the level of National Statistical Institutes and Eurostat. This means that European statistics are no longer produced domain by domain and source by source but in an integrated fashion, combining the individual characteristics of different domains/sources in the process of compiling statistics at an early stage, for example households or business surveys.

▪Vertical integration covering both the national and EU levels. This should be understood as the smooth and synchronized operation of information flows at national and ESS levels, free of obstacles from the sources (respondents or administration) to the final product (data or metadata). Vertical integration consists of two elements: joint structures, tools and processes and the so-called European approach to statistics (see this entry).”

There is a question regarding how the various versions of deliverable 3.1 should be seen in relation and how the intention is to the end of the project to set further steps in elaborating the documents.

Harry answers that the intention is to integrate the deliverables into a handbook and that there is a session planned for day 2. And yes, we need to be more clear on the different versions.

A second question is on the mapping of the GSBPM on the S-DWH, more specific if the ESSnet has feedback to the GSBPM, which could require some adjustment of the GSBPM.

Antonio answers that we used the GSBPM as basis/common language and that we are now looking also how to integrated the GSIM model.

Another questions was about the possibility on exchanging data between NSI.

In the modular workflow the vertical integration between NSIs and Eurostat was explained. Basically, this vertical integration should also cover the exchange between NSIs. But it is known that data exchange between NSIs is a still delicate topic .

There is a clear need to give more guidance in the documents regarding versions and conclusions (clear guidelines, recommendations)

4. Presentation Metadata SystemsMaia Ennok

Maia presents the backgrounds of the metadata system in the S-DWH.

A metadata layer is a conceptual term that refers to all metadata in a data warehouse, regardless of logical or physical organisation. A metadata system covers all metadata in the metadata layer, with the goal to facilitate and support the operation of the S-DWH.

A Metadata System should cover 4 functionality groups in the metadata management process:

▪Metadata creation

▪Metadata usage

▪Metadata maintenance

▪Metadata evaluation

These groups are explained using the subsets and layers.

At the end , a short practice case of iMETA from Statistics Estonia is presented

Comments Eurostat:

The presentation is so far about role, not yet on how setting up a system

The project welcomes the feedback from Eurostat and will consider this need in the Handbook (deliverable 4.5)
Overall it is difficult to give 1 straight developmentguideline, the focus of the project is to provide a more high level process design with specific recommendations.

5. Presentation Metadata QualityColin Bowler

Collin Bowler presents the work on metadata quality that has been done so far (deliverable 1.5).

First it is important to make clear thatQuality Metadata (QM) is something completely different from Metadata Quality (MQ):

▪QM = a specific type of metadata describing the quality of the data

▪MQ = the quality of all types of metadata in the S-DH

As metadata is the driver of the S-DWH, it is essential to have metadata of good quality, because only then the metadata is really usable. To ensure good metadata you need to measure the quality. Therefore the project adopted the idea of using dimensions to measure the quality of metadata (parallel to the data quality assurance).

Main recommendation of the ESSnet is to set up a quality assurance system based upon and using a set of measurable indicators, for which deliverable 1.5 gives a good basis.

Main feedback from the participants is to also make strong reference to ESS Q-assurance framework (QAF). The ESSnet welcomes this feedback and will pick this up.

6.Interactive sessionMetadata DimensionsColin Bowler

After answering several questions on the usability of the dimensions the participants are asked to discuss the importance of the proposed dimensions in the several layer and try to work out a practical case/example.

Overall the feedback is that the dimensions can be used to measure the quality of your metadata but that it can be very complex, so try to keep it simple.

Another feedback is that when elaborating from the source layer to the access layer, the dimensions to use increase and once a dimension is measured out should keep it in the following layers.

7.Presentation Role/Position of the BRPieter Vlag

Pieter presents the ideas on the role and position of the business register in the S-DWH.

To increase flexibility, the S-DWH desires an extended and integrated frame with 3 essential basic characteristics :

▪Number of enterprises

▪Turnover from VAT

▪Employment data from social security

The S-DWH Backbone

All other sources are linked to this integrated frame and made consistent to it.

Benefits of an integrated backbone are:

▪improving quality of integrated datasets from several input sources,
as 2 key variables for statistical output can precisely be estimated

▪reducing the impact of sampling errors and biases in estimates for derived variables,
as the 2 key variables can be used as auxiliary information when weighting

The Backbone = the authorative source of the S-DWH.

In addition it is important to realise consistency between statistical input and output. But in daily practice, the ideal situation of using only 1 statistical unit thru all (integrated) processes is not reality, asdifferent sources use different units. The relationships between the various input and output units must be known/established beforehand. Therefore the Unit Base concept is introduced to integrate the necessary relations in the S-DWH.

Eurostat states that a new ESSnet project is being set up, the ESBRS European Statistical BR system. The aim is to achieve interoperability of all BRs, alignment.

The ESSnet is aware of that and keeps close contact, but so far, focus was on S-DWH on national level.

A lively discussion follows, on how to integrate the BR in or how to attach it on theS-DWH, what the best method is to integrate the BR in the S-DWH, fully integrated (inside) or separated (outside).

There is no clear opinion/conclusion on favouring 1 of these 2 options.

The vision of ESSnet DWH is clearly stated:

▪the maintenance of the S-BR should be a separate activity, outside the S-DWH

▪the S-BR is essential input to the S-DWH

▪integration via snapshots as backbone into the S-DWH, as lean as possible.

The big challenge is to deal with the feedback of correctional info, the alignment.

Overall it is up to the various NSIs which solution the prefer.

8.Presentation Data LinkingNadezda Fursova

Nadezda Fursova presents the vision on the handling of data linking in the S-DWH. First a short explanation on the various types of linking are given, followed by the main conclusions:

▪Data linking in an integrated S-DWH solution does not differ from the stand alone solutions;

▪No specific methods of data linking are necessary in an integrated S-DWH solution;

▪However it does require a very good study on how and when the data linking should be processed.

9Discussion & Feedback SessionPlenary

10.Closing QuizPete Brodie

As there was already a lot of vivid discussions on the topics as such we run a bit out of time and there was not much need for more discussion. At the end of day 2 there will be a feedback session. So it was decided to skip the last 2 items and close day one at 17.05.

Agenda day 2Thursday 21 - 3 – 2013

1. / Opening + 2nd round quiz / Harry Goossens, Pete Brodie
2. / Presentation Selective Editing / Hannah Finselbach (ONS)
3. / Presentation Outliers / Garry Brown (ONS)
4. / ISTAT case study / Mauro Maselli
5. / Interactive session:

Output of the Handbook
Targeted Users

/ Lars Goran Lundell (Sweden)
Harry Goossens
6. / Confidentiality / Pete Brodie
7. / Interactive session; Feedback questions / Plenary

1.Opening day + 2nd round QuizHarry Goossens, Pete Brodie

Harry welcomes everyone back and shortly presents the programme of day 2.

Pete starts then with the second round of the quiz using the voting pads. Special thanks to Pedro who prepared questions on Tallinn.

2.Presentation Selective EditingHannah Finselbach, Orietta Luzzi

Hannah presents the work on deliverable 2.6 on selective editing.

The study on the selective editing options in a S-DWH context raised several questions, for which we would like to get input from member states based on their daily practice. Main goal is to get more insight in the situation in other NSIs and the vision/ideas on specific topics, with the possibility to contact them for further elaboration.

Selective editing is traditionally time consuming and expensive. Prioritisation is based on a score function expressing the impact of their potential error on estimates. This score should consist of risk (suspicion) and influence (potential impact) components. In a S-DWH context more efficient editing processes are desired.

The main methodological issues of selective editing in a S-DWH are identified:

▪How meaningful is the weight in a S-DWH:

Several sets of weights needed, tailored for different use ?

▪Selective editing ‘without a purpose’:

Importance of the weight for all potential use ?

Alternative editing approach needed ?

▪Scores to compare data sources:

Score functions, discrepancies or auto corrections ?

▪Selective editing of admin data:

Manual intervention ?

The identified tools used in the NSIs of Italy, Sweden, Australia and Netherlands are presented and the knowing of and/or usage are inventoried.

Finally, 2 experimental studies from ISTAT and ONS are presented and a 1stidea on the metadata requirements.

3.Presentation OutliersGary Brown

Garry presents the deliverable 2.8 on outlier detection in the S-DWH.

First various definitions of outliers and errors are compared:

An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation for Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as:

“A data value that lies in the tail of the statistical distribution of a set of data values”

The UK Office for National Statistics “ONS Glossary” defines an outlier as:

“A correct response, usually an extreme value isolated from the bulk of the responses,
or has a large sample weight that would have an undue influence on the estimate”

These two definitions illustrate that whether a data value is an outlier depends on the context.

In addition it is vital to make a clear distinction is between outliers and errors. Errors are erroneous values, and are dealt with via editing rules. The common theme here is the requirement ‘to correct’: errors are assumed to be incorrect, outliers are assumed to be correct.

In the context of a data warehouse there are three types of outliers, that are explained:

▪outliers in survey data

▪outliers in administrative data

▪outliers in modelling.

So far, 3 recommendations are defined:

Neither data units nor their entries in a data warehouse should be labelled as outliers.
Identification and treatment of outliers should be unique to each instance data are used.
Metadata on outliers should only be included in a data warehouse alongside outputs.

The participants were asked to give their opinion on these followed by a discussion.
The result is good input to finalise deliverable 2.8.

4.ISTAT CasestudyMauro Maselli

Mr. Mauro Masselli presents a case study of Istat regarding considerations to develop a S-DWH for SBS estimates.

5.Interactive Session

Output of the HandbookLars Goran Lundell

▪Targeted UsersHarry Goossens

So far we worked on the deliverables as scoped and presented.In previous WS’s the idea was brought up to develop a handbook.In our perception, the handbook is a well-structured combination of the deliverables. To find out what your idea of a handbook is, we would like you to give us feedback on the following questions:

▪What questions should (roughly) be answered by the S-DWH handbook ?

▪How do you want to use the handbook ?

▪What are the targeted users and in which role ?
With users we mean both of the handbook and the S-DWH

▪Do we actually need to ‘compose’ a handbook or is it sufficient to logically structure the deliverables.

Feedback Eurostat: See if it is possible to use the Workflow as guideline/roadmap for the handbook.

6.ConfidentialityPete Brodie

Pete gives a short recap on what has been done so far and where we stand now. Main issue is to avoid doing work that already has been done in other ESS-projects, as there are quite a lot.

Main issues indicated to cover:

▪Complex secondary suppression

▪Consistency in outputs: controlled rounding perhaps problem

▪Timing aspects

▪Record swapping

▪Usability of TAU-argus

▪Access to S-DWH (external users)

The case study of Lithuaniashows methods for securing several types of data (micro, macro, tabular).

7.Feedback session on the WSPete Brodie

At the end of the workshop Harry thanks every one for coming and the very active participation.

As last activity there is a feedback session to evaluate how the participants experienced the Workshop.

Overall the workshop was found very useful. Also the use of the voting pads was a very good tool to gather quick and easy feedback information.

The workshop was closed with the handing over the price for the best quiz-participant.

ESSnet DWH – Minutes Workshop Tallinn, March 2013