DIME/ITDG Joint Plenary February 2015/04/EN

JOINT DIME/ITDG PLENARY

24 FEBRUARY 2015

Item 4 of the agenda

ESS.VIP Data Warehouse Business Case

Eurostat1DIME/ITDG 24 February 2015/2/EN

ESS DWHBusiness Case

Eurostat

Methodology and Corporate architecture

Business Case

ESS DWH

Date: 22/01/2015

Doc. Version: 0.71

PM² Template v2.1.2 (Dec. 2013)

Business Case - ESS DWHPage 1 / 34

Document Version 1.00 dated 20/11/2007

ESS DWHBusiness Case

Document Control Information

Settings / Value
Document Title: / ESS.VIP DWH Business Case
Project Title: / ESS DWH /
Document Author: / Denis GROFILS
Project Owner: / Roberto BARCELLAN
Project Manager: / Denis GROFILS
Doc. Version: / 0.71 /
Sensitivity: / Public
Date: / 22/01/2015

Document Approver(s) and Reviewer(s):

NOTE: All Approvers are required. Records of each approver must be maintained. All Reviewers in the list are considered required unless explicitly listed as Optional.

Name / Role / Action / Date

Document history:

The Document Author is authorized to make the following types of changes to the document without requiring that the document be re-approved:

  • Editorial, formatting, and spelling
  • Clarification

To request a change to this document, contact the Document Author or Owner.

Changes to this document are summarized in the following table in reverse chronological order (latest version first).

Revision / Date / Created by / Short Description of Changes
0.5 / 08/12/2014 / Denis GROFILS / Initial draft
0.7 / 15/01/2015 / Denis GROFILS / Integration feedback Eurostat EA CoE DWH

Configuration Management: Document Location

The latest version of this controlled document is stored in <location>.

TABLE OF CONTENTS

1Project Initiation Request Information

2Context

2.1Situation Description and Urgency

2.2Situation Impact

2.2.1Impact on Processes and the Organization

2.2.2Impact on Stakeholders and Users

2.3Interrelations and Interdependencies

3Expected Outcomes

4Possible Alternatives

4.1Alternative A: Do nothing

4.2Alternative B: SDMX-based federated DWH

4.3Alternative C: Dissemination DWH re-using OECD.Stat components

4.4Alternative D: ESS DWH with standard ETL services

5Solution Description

5.1Legal Basis

5.2Benefits

5.3Success Criteria

5.4Scope

5.5Solution Impact

5.6Deliverables

5.6.1Plan & design

5.6.2Build

5.6.3Implement

5.6.4Use and maintain

5.7Assumptions

5.8Constraints

5.9Risks

5.10Costs, Effort and Funding Source

5.11Roadmap

5.12Synergies and Interdependencies

6Governance

6.1Project Owner (PO)

6.2Solution Provider (SP)

6.3Approving Authority

Appendix 1: Compliance with the ESS Vision 2020

Appendix 2: Impact assessment

1Project Initiation Request Information

Project Title: / ESS DWH /
Initiator: / Roberto BARCELLAN / DG / Unit: / Eurostat, Unit B1
Methodology and corporate architecture
Date of Request: / 22/01/2015 / Target Delivery Date: / 31/12/2019
Type of Delivery: / ☐In-house ☐Outsourced ☒Mix ☐Not-known

2Context

2.1Situation Description and Urgency

In order to satisfy divergent and ever-changing user needs at both national and European level the ESS Vision 2020[1] foresees the implementation of an EUData Pool based on a solid data warehouse approach. The Data Poolsupports the ESS’ move towards an open data philosophy and the promotion of re-use of ESS data. In this pool data is already processed and mutually linked and can be supplemented with additional data and may also serve as a basis to quickly assemble new outputs that may be compiled by statisticians or by external users.The idea is to have an environment where ESS members (NSIs and Eurostat) can publish and consume data and from where exploitation of statistical datasets can start (e.g. presentation as a particular product or re-use in the production of a new dataset). In this view an ESS Data Warehouse (DWH) acquires and integrates output data from ESS providers and other sources and make them available as aData Pool which is exposed in all relevant ways, and in particular to flexible dissemination servicesused for presenting the data to end-users (e.g. tables & visualisation components), to Open Data portals (EU and national) and any machine-readable formats (trough public APIs) offered to power users.ESS Vision 2020 projects supporting dissemination and communication of European statistics and new data sources will relyon the ESS.VIP DWHcross-cutting project as an important infrastructural counter-part.

Currently ESS-levelinfrastructure supporting information resources management capabilities is focused on Eurostat production, i.e. concerns data and services needed by Eurostat to produce its output but is not really intended to serve other ESS members out of the EU statistics context. In the future information is expected to be managed more as a shared organizationalresourceat the level of ESS community. A move in this direction will be supported by creating common platforms for datastorage relying on a data warehouse approach to data management, providing reusable components of a platform for output data pooling and access. In this view an ESS DWH consist of storages and a common access layer based on a shared information model, providing greater data and metadata visibility and usability. A shared/replicated/interoperable platform will allow sharing the investment in the data pooling platform and still allow flexibility and performance. Some members could share aspects of or the whole platform whereas other will implement it as a replicated solution.

In recent years intense methodological work in the field of statistical data warehousing has been carried out by2 ESSnetswhich handed over their results to an ESS Centre of Excellence on Data Warehousing[2] (CoEDWH). Some of the most important results of this work include recommendations on a generic DWH architecture for official statistics and an overall handbook for the setup of a statistical DWH in NSIs. According to the latest reports of the CoEDWH the level implementation of these recommendations remains quitelow among ESS partners.A reason for this is that developments produced although acknowledged as extremely valuable remain at a relatively theoretical level and leave a lot of space for concrete developments supporting DWH implementations in statistical organisations. In this respect huge opportunities are offered by on-going evolutions of statistical information systems towards Service Oriented Architectures (SOA) and in particular the ambitions of adopting a Common Statistical Production Architecture[3] (CSPA) at the level of the global statistical community. In this context implementation of ESSnet DWH resultscould essentially mean contributing to the definition, specification anddeployment of interoperable, replicable or shareable data warehousing services (e.g. standardized data persistence services, data integration services, temporal data management services, etc.).

TheESS.VIP DWH project willde factoset foundations on which a broader concept of ESS DWH could be implemented in the future. In this boarder sense the DWH is seen as more comprehensively supporting ESS statistical production as an alternative to the traditional stovepipe model (aka "data warehouse approach to statistical production"[4]). Nevertheless support to the integration of end-to-end production processes is not the scope of this project.

This project supports ESS Vision 2020 priority areas of "dissemination", "new data sources" and "efficient and robust statistical processes". It implements recommendations of the CoEDWHfor the establishment of a DWH data repository and services as an infrastructural building block supporting the EU Data Pool. The projectalso providesa basis on which an extended concept of ESS DWH could be developedin the future.

2.2Situation Impact

2.2.1Impact on Processes and the Organization

According to the European Commission process cartography, this project affects process category "Statistics Management (Analyses, Databases, Statistics)".

The reference process model for official statistics is the Generic Statistical Business Process Model[5] (GSBPM).Existing processes will not be impacted by the project, the idea being that ESS DWH will constitute a possible addition to current dissemination sub-processes where disseminated data of NSIs and Eurostat will be further integrated in an ESS DWH also facilitating the link with other public data sources.An ESS DWHwill touch the "Disseminate" phase of the GSBPM and provides new possibilities in the "Collect" and "Process" phases through the sharing of Extract-Transform-Load (ETL)services components and the provision of additional data sources to ESS partners.

Overview of GSBPM phases

Overview of GSBPM sub-processes

The following GSBPM sub-processes are in scope of this project:

  • Update output systems: Write data/metadata to the ESS DWH so that outputs are made available through data access servicesbased on standardized APIs. In the ESS DWH this means making data and metadata available to an access layer constitutedby the EU Data Pool.
  • Integrate data: Combine data from various sources including transition to common data and metadata modelsand semantic linking of integrated sources based on standardized interfaces.
  • Set up, Run & Finalize collection: Support design and execution of sub-processes necessary to extract information from heterogeneous sources. This would cover the acquisition of data and metadata in a wide range of formats (e.g. CSV, XML) using a wide range of process patterns (e.g. including data virtualization techniques, schedule or event -driven processes, push or pullmodes, …).
  • Quality & Metadata management: Includes for example intermediary validations and production of process metrics, or support for traceability and reproducibility of information inside the ESS DWH.

In terms of services the coverage of the GSBPM described here can be mapped to the breakdown of ETL activities in typical DWHpractice. Data made accessible by ESS users will provide new sources usable used in dissemination and collection sub-processes of ESS members. It should be noted that the ESS DWH concerns the level of data exchange between NSIs, Eurostat and possibly other data providers and follows regular production processes covered by the GSBPM.

2.2.2Impact on Stakeholders and Users

The ESS DWH is expected to have the following impact:

  • ForESS users:
  • The ESS DWH is an infrastructural and functional building block supporting the EU Data Pool that should provide a single point of access to its content in open machine-readable formats;
  • Greater ability to access and combine available datasets one with each other and with other data sources integrated in the DWH by ESS members.
  • ForESS producers:
  • ESS DWH as a new dissemination and reporting channel for EU and other data through which connection to the EU Data Pool, new dissemination services and open data portal will be streamlined;
  • Access to new data sources in a machine-readable way that can be injected in their production cycles allowing new types of analysis;
  • Access to implementation guidelines based on and extending recommendations of the CoEDWH supporting local DWH implementations;
  • Access by individual ESS members to shared data access and integration service interfaces and instances developed for the operation of the ESS DWH.

In the end the ESS DWH approach should help statistical information systems to achieve the overall goals of reusability, flexibility and efficiency, and to follow the software design and software architecture design patterns of a Service-Oriented Architecture (SOA). In line with the ESS Enterprise Architecture framework the ESS DWH designwill respect NSI legacy systems, investment programmes and the principle of subsidiarity, in particular by:

  • Enabling a gradual or selective (“opt-in, opt-out”) adoption of selected components of) the to-be state architecture by the ESS members;
  • Enabling gradual investments in the ESS architecture, broken down into value-creating elements;
  • Enabling ESS members to utilize investments by providing different scenarios for how individual NSIs will implement the Vision;
  • Maximizing the sharing of investments in IT solutions and still make it sufficiently flexible to match different investment cycles with ESS members.

2.3Interrelationsand Interdependencies

  • The ESS Enterprise Architecture Reference Framework (ESS EARF)defines an Enterprise Architecture (EA) in the context of the ESS and provides principles, guidelines and indicates good practices for creating and using the architecture description in the context of ESS Vision implementation projects both at European and national levels. The ESS DWH project builds on the ESS EARF that provides a list of key standards ensuring smooth integration of the future architecture building blocks. It will contribute to the development of key capabilities in the domain of data and metadata management and contribute to the development of ESS metadata management components and the implementation of a service-based configurable production platform.
  • The CoEDWHhas started in October 2013 as a virtual body continuing the work the ESSnet project on Data Warehousing to ensure sustainability of the results and the acquired expertise. One of the tasks of the CoEDWH is to provide adhoc support, consultancy and/or expert reports, on request of ESS members. The CoEDWH will support this project in several ways:
  • Provision of expert opinion and support to the development of an architecture for an ESS DWH;
  • Being a channel for involvement of NSIsin the project;
  • Promote the project at ESS level.

This project could possibly lead to the extension of the role of the CoE DWH for example including a governance board of the EU Data Pool and the underlying DWH clearing the data to go in the DWH containers. A new role could also cover direct involvement in the operation of the ESS DWH.

  • ESS Vision 2020 cross-cutting projects aim at the development of a set of key building blocks of the future architecture and infrastructure supporting the implementation of the ESS Vision 2020. Integration of the ESS DWHproject with other cross-cutting projects is a key element of the approachas these projects represent the various infrastructural facets of the same to-be state:
  • European Statistical Data Exchange Network (ESDEN): Will target the uplift of the ESS network infrastructure ensuring continuity of systems and reduced migration costs. The mission of the network infrastructure is and will be in the future to provide value added services to exchange statistical data and metadata among ESS partners at reasonable cost, with high agility, high security and continuity. The operation of an ESS DWH implies data and metadata exchange for which a network infrastructure constitutesthe backbone.
  • Information Models and Standards (IMS): Will allow each activity of the statistical production process to be supported unambiguously by relevant metadata based on standards reaching a consensus at ESS and European Commission levels and aligned with statistical industry and other standards. Metadata information systems and their underlying models play a fundamental role in a DWH environment. The DWH project will provide requirements to the IMS project notably related to the need for fully active structural and process metadata and for metadata reporting to end-users. Questions related to metadata-rich data exchange, interoperability of information systems and requirements for the dissemination of linked open data are also expected to find support in the ESS.VIP IMS cross-cutting project. Another point for which support from ESS.VIP IMS is expected concerns DWH-specific metadata and in particular the consistency and integration of statistical metadata and DWH metadata.
  • Shared Services (SERV): Will aim at putting in place a SOA and related governance allowing for sharing of services across processes and partners and to issue principles and guidelines for the development of services facilitating their integration in the future target architecture. A set of services will be developed and deployed in the context of the ESS.VIP DWH. This concernsDWHinput/output services and more generally ETL services used for the integration of a wide range of data. ESS DWH services will be designed in the context of a SOA and implementing recommendations from the CSPA. The topic will be dealt with based on inputs from the ESS.VIP SERV.
  • Common Data Validation Policy (VALIDATION): Aim at deploying a coherent data and metadata validation policy in the different statistical domains and a standardised validation language. Integration of data from ESS members in a common DWH requires sufficientquality of data and metadata for a proper functioning. Furthermore, if all ESS members are allowed to publish data to an ESS DWH quality has to be monitored transparently for all published outputs. In this area the ESS.VIP Validation is expected to provide servicesallowing embedding specific quality controls in ETL flows and enablingadditional quality controls ensuring a proper functioning of the ESS DWH.
  • Big dataDWH 2.0[6]: Integration of big data in the ESS source-mix is a hot topic and an extension of classical ETL solutions to provide big data extraction, transformation, and loading between big data and traditional data management platforms is an issue that will need to be tackled in collaboration with the ESS Task Force on Big Data (TFBD) and other relevant initiatives. Integration solutions for the loading and conversion of structured and unstructured data as well as the opportunities offered by data virtualization techniques (e.g. virtual DWH) are challenging the architectural paradigm of modern DWHs. Evolutions of the notion of data life cycle in the DWH, inclusion of less structured textual information and issues linked to data-metadata integration are at the heart of the concept of DWH 2.0.Although the support to big data and implementation of DWH 2.0 approaches are clearly not the focus of this ESS.VIP DWH projects all these trends can't be ignored in the design of a modern DWH environment.

3Expected Outcomes

A DWH is the most appropriate solution for combining data extracted from various sources (including external sources) in a single architecture. It is an enabler for data analysis, allowing to record and distinguish time-variant versions of the same data point or selecting the appropriate level of data grain depending on the business needs. Common infrastructures mightbe the most efficient way for capturing, storing and sharing data in standardized forms.

The release of an EU Data Pool to different users segments, including throughstandardized access data services that allow navigation, reuse and visualisation,is enabled by implementing an ESS DWH architecture that consolidates, integrates and serves EU datathrough a single access point.Such an approach will address the diversity of different needs by offering all available statistical data which means to optimally (re-)use statistical data from a user’s point of view. The ESS DWH will be defining and implementing a modular reference data pooling platform and standard services that allow interoperability among different organizations involving the exchange/sharing of data and metadata between their respective informationsystems addressing specific issues of data warehousing including data storage, data integration and a common IT network for data exchange. This approach supports standardisation of software and methods, and enables sharing services and interfaces for data and metadata management. In the medium term a more integrated access to European statistics should result in an improvement of national and international data/metadata managementprocesses. Data will be easier to reuse (“collect once, use many times”) and more efficient governance and maintenance will be enabled. Over time a reduction of data integration costs in European statistics is expected.