Doc. Eurostat/ITDG/October 2005/4.1a

IT Directors Group

24 and 25 October 2005

BECH Building, 5, rue Alphonse Weicker, Luxembourg-Kirchberg

Room AMPERE

9.30 a.m. - 5.30 p.m.


Web based tools for Raw Data collection
a) Raw Data Collection on the Web with eSurvey
Swiss Federal Statistical Office


Item 4.1a of the agenda

2

Table of Content:

1. Abstract 2

2. Purpose and Scope 2

3. Architecture 3

3.1. Actors and Roles 3

3.1.1. Statistician 3

3.1.2. Respondent 4

3.2. Application Layer 4

3.2.1. Questionnaire Manager 5

3.2.2. Survey Manager 6

3.2.3. Distribution Server 6

3.2.4. Web Server 6

3.3. Technological Layer 7

4. Development 8

5. Business View 10

5.1. Survey Profiles 10

5.2. Experiences 11

6. Benefits 12

7. Conclusion 13

8. Glossary 14

9. References 14


List of Figures:

Figure 1: SFSO IT Strategy 3

Figure 2: Application Layer (Process View) 5

Figure 3: Application Layer (Subsystems) 5

Figure 4: Technological Layer 7

Figure 5: Input Data Channels 2001-2003 12


List of Tables:

Table 1: Technological Layer - Software 8

Table 2: Ideal eSurvey Profile 10

Table 3: eSurvey Project Overview 11


Raw Data Collection on the Web with eSurvey

Nicki Thomas Spöcker, John Cunningham, Bertrand Loison
Swiss Federal Statistical Office
CH-2010 Neuchatel, Switzerland
{ nicki.spoecker, john.cunningham, bertrand.loison }@bfs.admin.ch


Michel Meyer, Michael Körsgen
Federal Department of Home Affairs, IT-services center
CH-3003 Bern, Switzerland
{ michel.meyer, Michael.koersgen }@idzedi.admin.ch

1.  Abstract

Data collection by means of internet technology has become more and more important for statisticians. In 2004 about 68% of the citizens and a large majority of businesses and governmental institutions had the appropriate infrastructure in place.

It is believed that by strengthening this input channel, statistical agencies can obtain better data quality and save money. Traditional paper-based surveys, for example, require scanning/recognition infrastructure and services. Still it has to be mentioned, that not every survey-profile is just right to use the online data channel.

Economy measures have forced the OFS to improve its IT-strategy. The project eSurvey implements the online data collection process using web based forms and is an important element within the office’s IT-strategy. The resulting application is designed according to the principle of reusability and maximum independence from IT-staff.

This paper describes the eSurvey platform and the experience of the Swiss federal statistical office (SFSO).

2.  Purpose and Scope

The relief of the burden on the respondents, better quality of incoming statistical information and economical constraints contribute to the office’s business strategy. Outsourcing of IT Services and operation increase the dependency from IT vendors and service providers in the field of statistical computing. To align business needs to IT [SEAM], a new application architecture must focus on a combination of software suites (packaged software) and on software modules, designed for reusability, parameterization and independence from IT.

The primary goal of the OFS IT strategy [SIP] is to transform its IT landscape from individual statistical applications (stove pipes) to independent, reusable software modules. Statistical production can be expressed by a sequence of process segments – collection, transformation, analysis, upload and diffusion. The software modules provide functions covering the requirements of process segments.

The platform eSurvey represents a software module designed to support online data collection using web based forms, an element within the process segment data collection (electronic survey).

The figure below illustrates the transformation from stove pipes to horizontal applications.

Figure 1: SFSO IT Strategy

3.  Architecture

This chapter highlights the architectural concept of the eSurvey platform. The platform has been built according to the principles of the SFSO IT strategy.

3.1.  Actors and Roles

Two actors can be distinguished:

  • The statistician is a skilled person from the statistical department. He creates the questionnaire with rules and layout definition. During operation, he will be responsible for the administration and monitoring of the survey.
  • The respondent (or end-user) is the person that will fill out the questionnaire.

3.1.1.  Statistician

The statistician needs an application to design the form, define rules (online plausibility functions), describe import and export transformation without having to program a single line of code.

The statistician should have the freedom to put on line a survey, determine replication and export frequency without help from IT staff.

3.1.2.  Respondent

The end user must be able to fill out the survey forms without having to install any additional tool on his computer. The form must be simple and fast. A session interrupt must be possible without loosing any data he filled out.

A standard web browser is all he needs. No third party tool, viewer or plug-in is needed except Adobe Acrobat Reader if a PDF file of the questionnaire is required for printout.

3.2.  Application Layer

Three areas, separated by network boundaries, compose the platform eSurvey. The statistician develops the eSurvey questionnaire by using the eSurvey applications Survey-manager and Questionnaire Manager. Next they upload the eSurvey metadata (data-model, questionnaire layout, rule definitions, export/import metadata) and respondent data (authentication data and, optionally, additional data) to the distribution environment. The authentication information (user-identification, password, internet address) is then sent to the respondents (enable logon, postal service) and the eSurvey is published to the secure web server environment. The response then can be retransmitted in a two-step process (replicate and export).

Finally the statistician has to upload the export files to the production system for further processing (transformation, analysis, inquiries).

The figure below illustrates the most basic process steps; the numbers indicate the sequence of operations.

Figure 2: Application Layer (Process View)

From another perspective the eSurvey platform can be described in application domains (Figure 3). The statistician works within the Intranet (sub system 1 and 2). During operation of the eSurvey, raw data is continually exported to the production system, where the final statistical output is generated. The associated application domains offer the appropriate services.

The web hosting platform here is represented by sub system 3.

Finally, the internet can be illustrated as the respondent’s application domain. The standard browser client and the internet protocol are the respondent’s tool to fill in the survey web based form; the data are protected using SSL-encryption (https).

Figure 3: Application Layer (Subsystems)


The following subchapters show the basic functionality of the most important applications and servers of the eSurvey platform.

3.2.1.  Questionnaire Manager

The Questionnaire Manager is a client / server application that enables the statisticians to create their specific online survey without any help from IT staff. The application is a WYSIWYG graphical application that shows the questionnaire, as it will be displayed on the web. The whole configuration will be saved as metadata model with XML files. This metadata describes the layout, rules, import/export definition, validations and resources like classifications, multilingual text, images and reports.

The tool has a simulation function where the online completion of the whole survey form as well as the export of the entered data can be tested extensively before installing on the web servers.

3.2.2.  Survey Manager

The Survey Manager, another client/server application, allows storage and retrieval of the metadata models. The XML files are maintained in a relational database using character large objects (CLOB). This approach supports the reusability of eSurvey projects by combining the metadata model with configuration management utilities (search and retrieve, checkIn, checkOut, copy/paste, versioning, source and access control).

With the survey manager the statisticians can reuse older versions of specific eSurvey projects as a blueprint for new eSurveys.

3.2.3.  Distribution Server

The Distribution Server is a platform that allows piloting the online survey. This is an Intranet web application secured by browser based client certificate. First it imports the metadata from the Survey Manager, afterwards it imports the respondent credentials (username and password) as well as the initial data of a respondent (from production system). This platform will then send the imported data to the web through web services and HTTPS secured communication. The function of the distribution server is also to periodically query the web to look for completed forms and to get back the data (replication). Once the data has been replicated back, the Distribution Server exports the data (and even generates images identical to the paper forms that have been scanned) and transfers this back to production system through SFTP.

Each survey has its own configuration environment with a dedicated database.

A task scheduler that runs independently per survey will do the replication and export tasks.

3.2.4.  Web Server

For stability and flexibility reasons, each survey has its own web space and own database. First the respondent has to be authenticated, then the application shows a list of respondents for which he has been delegated and than the list of forms that can be filled out. The form will be rendered according to the questionnaire, role and initial data from a specific respondent. After each page, the content of the form is validated and stored into the database. So the respondent can at any time stop and continue to fill out the questionnaire later on. At the end a summary of all the pages is displayed which can be printed out or saved as a file (PDF format) to a local drive.

The respondent can use a standard web browser without any special plug-in to fill out the form.

3.3.  Technological Layer

In Figure 4 the hardware (servers, firewalls), database systems, network protocols and network boundaries are shown, along with actors and data flow.

Figure 4: Technological Layer

The main goal of the architecture was to build a system that is flexible, scalable and loose coupled. We achieved the target by splitting the system into 3 sub systems (Figure 4). Each of these sub systems runs independently, as well as each eSurvey application which runs independently of all the other eSurvey.

Table 1 gives an overview of the underlying software technology.

Name / Technology / Repository
Quest. Mgr. / C/S Microsoft .NET / File System
Survey Mgr. / C/S Microsoft .NET / Relational DB, repository for layout, rules, resource, import/export definition
Production System / Any / Any, Export / Import as File (Flat, CSV, XML, Excel)
Distribution Server / Microsoft .NET Web Service / Relational DB
Web Admin / Microsoft .NET Web Application / None
Transfer Server / Microsoft .NET Web Service / Relational DB
Web Server / Microsoft .NET Web Application / Relational DB
Respondent / Web Browser / None

Table 1: Technological Layer - Software

4.  Development

The core project team at the beginning was composed of four people coming from business and IT. The business people brought considerable statistical experience together with an in-depth knowledge of questionnaires and their construction and they formulated the basic business requirements. The IT members all had several years in building client/server solutions and developing web applications and they proposed technological solutions based on their practical experience. This marriage of business with IT expertise worked extremely well within the small motivated group.

The key idea at the beginning was to concentrate on Meta modelling concepts. The team was persuaded that the only way to create a flexible and extensible application which could be used for the creation of web base surveys was to build generic modules which are highly configurable when using Meta models. So the first weeks were used to identify all the sectors of an Internet based survey, which could be made flexible and these were described in conceptual class models (UML). These were then put together with the meta model information in areas like the data schema of a questionnaire, layout, validation rules, navigational aspects, respondents data, import and export rules, transformation rules, classifications and their code values, different resource information like images etc in order to design the basics of the application. The Meta models became fairly complex and the project team feared that the key requirement to enable a skilled statistical end user to define all these rules through a software application could be difficult to achieve. So a first prototype was build from the IT side, inspired by a workbench approach used by so many software development environments like Delphi, Visual Studio and others. The result was a visually attractive survey designer workbench called the Questionnaire Manager. This prototype was shown very early on to some end users and the feedback was so extremely positive that it really confirmed to the project team the soundness of their approach. In addition the feedback from the users help the team stabilise and complete the Meta models.

The Meta model was not realized in a concrete relational or object oriented database design. Because of the needed flexibility the team chose the approach of storing this information in XML data files with corresponding XML schema definitions. Change management was handled more easily in this way. Software applications were specified and the accent was placed on defining the interfaces between them, wrapped most of the time XML schema definitions.

After 2 months of this conceptual and prototypical work the different application components and their interfaces became clear and were fairly well specified. A security expert joined the team and the IT development team was brought up to 6 people during the realization phase. At this point it was possible to start a parallel development of the components in order to keep on schedule. The business people helped a lot during realization by continually using intermediate releases to configure the first concrete projects and their feedback and experiences were used to further enhance and streamline the application. This method was like a continuous approximation of a final solution through iterative and incremental steps. The generic software modules were also incrementally validated and the whole project team got a good feeling being on the right path.

The team finished a first version in exactly 6 months, which was the originally planned schedule; the positive pressure of the first surveys, which had to use this application toolset, was essential in achieving this goal.

Putting things together, it can be said that the main key success factors were

  • Putting together experienced business and IT people with experience in the domain of Internet surveys together in one small team.
  • Flexible and extensible meta modelling with UML
  • Prototypical approach to master complexity from the very beginning
  • Initial focus on specific business surveys with specific statistical projects as “clients” helped to establish concrete real world requirements
  • Consequent incremental and iterative development in which the solution grew continually through small steps and approximations to a first productive version.
  • This incremental approach was continued also after the first version and in this way the system was continuously released in intermediate releases which were triggered by the requirements of the increasing number of participating surveys.

The prototypical approach at the beginning based on conceptual models and the incremental and iterative development of the applications with the immediate quality checks from the business users had a tremendous impact on the motivation of all people involved in this project, because they really felt that things were evolving in the right way. The system is now over one year in production, new features have been introduced and the maturity of the environment proven in 10 surveys that have been realized with this environment.