Project: XML-Publishing - Implementation Strategy
Contract: Specific Contract 17101.2006.001-2006.457
Prepared by: VBE, CBO Reviewed by: MFE
Version 2.0 / Date Updated: 15/12/2006
Status: Company Approved Page 3/20
STIS
Statistical Information Systems / Consortium
INTRASOFT INTERNATIONAL S.A.
and
AGILIS S.A.

European Commission – EUROSTAT/B3

Framework Contract 14200/2005/007-2005/699 - Lot 1

Specific Contract 17101.2006.001-2006.457

‘XML-Publishing’

Implementation Strategy of an XML-based publishing

in Eurostat

D1.1: Analysis of the publications programme, dissemination process and data life cycle

December 2006

Document Service Data

Type of Document / Project deliverable
Reference: / Analysis XML-Publishing v2.0.doc
Issue: / 1 / Revision: / 1 / Status: / Company Approved
Created by: / Victorio Bentivogli, Christian Boudot / Date: / 15/12/2006
Distribution: / EU-Eurostat, Intrasoft International S.A.
Contract Full Title: / XML-Publishing - Implementation Strategy
Service contract number: / Specific Contract 17101.2006.001-2006.457
For Internal Use Only
Reviewed by: / Mario Fendler
Approved by:

Document Change Record

Issue/Revision / Date / Change
1.0 / 08/12/2006 / Draft document
1.1 / 15/12/2006 / Feedback integration
2.0 / 15/12/2006 / Final version


Table of contents Page

1 Introduction 5

1.1 Purpose 5

1.2 Scope 5

1.3 Structure 6

1.4 References 6

2 Background information 7

2.1 The Eurostat publishing system 7

2.1.1 High level Requirements 7

2.1.2 Desired Functionality 7

3 Analysis of the Publications Programme 8

3.1 Scope of the analysis work 8

3.2 Publication samples 8

3.3 General comments on the provided publications 9

3.4 Layouts 10

3.5 Multilingual publications 10

3.6 Versions 10

3.7 Components of a publication and assembling them 10

3.8 Selection of a representative publication for a pilot 10

4 Analysis of the Dissemination Process 11

4.1 Scope of the analysis work 11

4.2 Target environments 11

4.3 Target formats 11

5 Analysis of the Data Life Cycle 12

5.1 Scope of the analysis work 12

5.2 Overview of the process 12

5.3 General comments and requirements 13

5.4 Content sources 15

5.5 Translation 16

5.5.1 Support for translations and different language versions 16

5.5.2 Management of language versions 16

5.6 Assembling a publication 16

5.6.1 Templates 17

5.6.2 "preview" of a publication 17

5.7 Collaboration environment for multiple authors 17

6 Recommendations 17

6.1 Facts and constraints/conclusions 17

6.2 A technical proposal 18

6.2.1 Initial preview of the proposed pilot’s architecture 19

Table of Figures

Figure 51 Overall process flow 13

Figure 61 Proposed pilot architecture 19

1  Introduction

1.1  Purpose

This documents stands for the deliverable “D1.1 Analysis Document” and is the outcome of Task 1 “Analysis of the publications programme, dissemination process and data life cycle”.

The main objectives of this work are:

·  Analysis of the entire publication process;

·  Focus on the content creation part: definition, creation and storage of the publications in XML, based on current content creation tools (MS-Office, Word-macros) taking into account alternative solutions (e.g. web-based editors);

·  Explore the possibilities for content validation, proof-reading, multiple language version (translations);

·  Make the first recommendations in order to define (or select of an already existing) XML-based format for all publications with a focus on statistical publications, where the most common element types are: data tables + corresponding graphs, text, maps and graphics (pictures, schemas…).

1.2  Scope

The overall project scope covers the tasks and sub-tasks listed below.

The current document focuses on Task 1 – Implementation Strategy, more specifically on sub-task 1.1, in which an analysis is conducted for the current publications programme, the current dissemination process and the related data life cycle:

·  Task 1 – Implementation Strategy

·  Subtask 1.1 – Analysis (publications programme, dissemination process, data life cycle)

·  Subtask 1.2 – Design of an XML-based workflow

·  Task 2 – XML-Schema

·  Subtask 2.1 – Analysis & Evaluation of existing standards

·  Subtask 2.2 – Design of XML-Schema(s)

·  Task 3 – XML-Content

·  Subtask 3.1 – Analysis of existing data bases + data formats

·  Subtask 3.2 – Analysis of the XML content creation process + evaluation of solutions

1.3  Structure

The structure of this document is as follows:

Section 2 introduces the XML-based 'Eurostat publishing system' (EPS), highlighting its initial high level requirements and desired functionality.

In section 3, the Publications Programme is analysed. This analysis will provide an insight on the diversity and particular requirements to produce each type of publication. It will focus on the generation and assembly of key subparts of publications (like tables and charts), as well as on the degree of flexibility required to define specific layouts for different kind of publications.

In section 4, the Dissemination Process is analysed. This analysis will bring information about the data formats and constraints imposed by the receiving parties.

Section 5 presents the lifecycle of ‘data’ or content assets, showing the diverse workflows that could be followed in order to assemble a publication.

Finally in section 5, we present our recommendations based on the findings of sections 3 – 5.

1.4  References

This document references:

Reference / Document/Resource Name / Filename
[R1] / D0.1: Minutes of the Project Kick-Off Meeting on 13. Nov. 2006 / Kickoff Meeting 2006-11-13 v1.2.doc
[R2] / Eurostat publishing system (EPS) – Working document – 06/12/2006 / 2006 12 06 XML based publication system.doc

2  Background information

2.1  The Eurostat publishing system

The XML-based 'Eurostat publishing system' (EPS) aims to automate the production of Eurostat publications with the following initial objectives:

·  to reduce production times;

·  to allow easy dissemination in multiple output formats (paper, PDF, HTML);

·  to provide support for easy collaboration between different authors;

·  to help the authors to re-use the existing content without the need to re-create it from scratch;

·  to allow for format independent archiving of all publications;

·  to allow sharing publications between NSIs

·  to improve the consistency in layout ;

Currently the publication workflow is managed by an existing system (Statpub). One of the objectives of this analysis is to explore the possible integration/cooperation between the proposed publishing solution and this existing component.

There is also an initiative to put in place an object repository that will store all the business-related content assets (manuscripts, proofs, e-mails and other documents).

2.1.1  High level Requirements

The initial requirements of the EPS (they might be expanded by the results of this analysis) are:

·  It must not require authors to understand the technology behind (e.g. XML)

·  It must not require extensive licence management;

·  It must be user-friendly – not requiring extensive training for basic use (although training can be needed for advanced users)

2.1.2  Desired Functionality

The EPS should offer the following functionality:

·  create content fragments

·  edit existing content fragments (with versioning support)

·  support re-use of existing content fragments and content collections

·  build publications (publication = collection of content fragments)

·  support multilingual publications

·  support cooperation with translation services

·  support collaboration of multiple authors working on the same content fragment or collection of content fragments

·  user access rights management

·  publish finished and approved collections (e.g. create a PDF for print, upload to a web site, publish in HTML)

·  archive all created and edited content

3  Analysis of the Publications Programme

3.1  Scope of the analysis work

In this section we will review our findings on the representative samples of publications provided by Eurostat (that are part of the Publications Programme). In addition the suitability of these samples for an XML-based production process will be analysed and a set of publications will be proposed for the future pilot project.

3.2  Publication samples

The representative set of samples provided by Eurostat is comprised of:

·  Detailed tables

·  Panorama of the EU

·  Pocketbooks

·  Statistics in focus

·  Working papers and studies

Some of the documents also include a supporting CD with electronic versions of the document and related data (other documents in PDF format, statistical data stored in spreadsheets, HTML pages, etc.).

The sample set of provided documents is completed with a DVD (Eurostat electronic library) containing electronic versions of these publications and:

·  Eurostat news

·  Methods and nomenclatures

·  Research in official statistics

The sample set of documents also include a section of ‘Catalogues’ with promotional material.

From the 1st of January 2007, the set of Eurostat collections will be simplified. Collections 'Detailed tables' and 'Panorama of the EU' will be replaced by a single collection 'Statistical books'. 'Working papers and studies' and 'Methods and nomenclatures' will be replaced by a new collection 'Methodologies and working papers'. The collection 'Statistics in focus' will be complemented by 'Data in focus' offering only tables with methodological notes. The collections Eurostat news and Research in official statistics will be abandoned.

3.3  General comments on the provided publications

Most of the publications include a varying combination of:

·  Text

·  Tables

·  Graphs

·  Maps

·  Images

At this stage it is important to understand the technical requirements to compose and assemble the different parts of these publications in order to select tools (editors, visualisers) that could provide adequate functionality.

After an initial evaluation, we can observe that:

·  Typesetting requirements can be covered (with the exception of covers probably) by a standard word processor;

·  Most of the information (tables, charts) are generated with data extracted from databases;

·  The tables found in the publications sometimes have “calculated cells”;

·  Based on the material provided, the volume of information processed is significant (production scale);

·  Each publication family follow a consistent layout. Nevertheless there are exceptions since the main objective is to present information in an understandable way prioritizing readability over style; further harmonization of the layout is acceptable and desirable. A new Eurostat style guide is currently under preparation. The aim of the Style guide is to define basic layout rules which should assure consistent look-and-feel of all Eurostat publications.

·  It is important to facilitate the assembly of publications from diverse sources since besides tables and charts, custom graphics (for example maps) will be imported.

3.4  Layouts

The technology used for the EPS might put some restrictions on layout of the publications. If necessary, the modification of the default layout for a publication type could have to be considered.

3.5  Multilingual publications

As a general rule, Eurostat publishes today only in English, French and German either as three different language versions or as a trilingual (3A) publication. Extension to further languages should be possible in the future.

Since one of the objectives of this project is to propose an XML based process aimed to streamline and facilitate facilitate the publication process, this could indirectly lead to an increase in the number of language versions produced.

3.6  Versions

The management of versions and the number of versions kept are to be discussed. In particular, if all versions are kept, the implications and side effects should be analyzed, especially on the search within the content fragments (how will author find the content fragment he/she is looking for?). Possibilities: to discard working versions once the final version is validated and approved; to keep all versions but allowing to limit search in the final versions only.

3.7  Components of a publication and assembling them

The content is organized into content fragments and stored in a repository (Comment from Eurostat: ‘most likely object repository developed as a part of Statpub 3.4.2’). Search and browsing in the content fragments will be supported. The fragments could be organized into structured collections.

There is an open issue related to the possibility to define the flow of the text in a publication composed by different content fragments (for example, if a chapter of a book consists of text, tables and graphs, the way the tables/graphs are embedded in the text need to be defined. Further, there might be cases where a reference to a specific table/graph appears in the text).

3.8  Selection of a representative publication for a pilot

Most of the analyzed publications expose a similar degree of complexity and diversity of input sources. Also all of them seem to be suitable in principle for XML ‘treatment’ In order to be able to reproduce the complete process in a pilot we establish the following criteria:

·  Most of the content must come from data sources (document where the content is mostly tables and charts), this will facilitate generation since less custom text is required;

·  The (paper) size of the publication must conform to a standard easily treated by office printers (for example ISO A4);

Following this criteria, we propose to use a set of ‘Detailed table’ document (for example the Eurostatistics – Data for short-term economic analysis) in order to test the proposed solution(s).

In order to evaluate the content production in context, the pilot will take into account that the data is requested dynamically from the database at the moment of the final compilation of the particular issues of Eurostatistics (to assure that the most recent data is published) and that charts are generated this ‘fresh’ data by the publishing system.

4  Analysis of the Dissemination Process

4.1  Scope of the analysis work

In this section we will review the Dissemination Process, our findings will be used in a later step in order to select the technical components required to:

·  render the XML based publications onto the desired target formats;

·  Propagate the output files to the selected channels timely.

4.2  Target environments

The dissemination process is quite straightforward. The publications can be disseminated to:

·  The web (Eurostat as well as other sites like Europa);

·  A printer organisation in order to create paper copies.

4.3  Target formats

The publications are disseminated in the following target formats:

·  HTML for internet publications on web sites;

·  Web optimized PDF for web downloads of the publications;

·  Print optimized PDF for the printer organization.

The current assumption is that PDF for printing will be created directly by the printer. However, the process design will be extended to cover 'print-optimized PDFs' along with 'web-optimized PDFs'. The print- and web-optimized PDFs will be usually different (e.g. in the usage of colours RGB versus CMYK).