ARCAM: reengineering of Admin Data acquisition

Guido Drovandi ()[1],Paolo Giacomi(),

Maura Giacummo (), Eleonora Sibilio ()

Keywords:Admin Data Collection, Registries,Data Transmission Monitoring

1.Introduction

AdminData management,in the Italian National Institute of Statistics (Istat),has being rapidly innovating in the recent years thanks to the progressive modification of the statistical production system.

The overall Istatmodernisationprocess, launched in the recent months, is focused on the new System of Integrated Registriesmainly based on administrative data[1,2]. This new process has some consequences also from a technical point of view: administrative data, for example, must have a security policy of acquisition and they have to arrive in time to permit the linkage and dissemination processes. It has emerged the need to design engineered processes in order to automatize and manage the acquisition and distribution of administrative data. ARCAM provides an interface for the AdminData holder that guarantees the security of the data transmission process from public and private institutions, through the Istat standard technologies.The strengths of the engineering process of AdminData transfer are:

  • different ways of data transmission are possible to meet the suppliers' needs;
  • a centralized repository is provided to better ensure compliance with the legislation on the data treatment;
  • the ARCAMdatabase allows the monitoring of the acquisition process;
  • optimization timeliness and efficiency in acquiring AdminData.

Supplier could send data using web application (HTTPS in the following), SFTP or web service according to their own needs. Usually HTTPS is used for small or medium data, while suppliers of big data prefer the SFTP channel. Admin Data remaininto atemporary storage area to ensure the possibility of modification to the supplier. Once adata set is promoted into the permanent repository, changes are not allowed, but the internal stakeholder is able to gather it (Figure 1).

Figure1

ARCAM is one of the three IT tools to support AdminData management, it ensures the step of data acquisition. After this step, administrative datasets could enter into the standardized loading process called System of Integrated Microdata (SIM) [3] or could be disseminate to internal user through a web interface.

2.Methods

In this section, the data collection process (acquisition and monitoring) is introduced with a focus on the HTTPS channel and system features.

2.1.Data transmission through HTTPS channel

The systemoffers, to a supplier, the possibility to transmitAdminData using a simple web application. This application allows a user to send files regardless of the size, in multiple working sessions,ensuring confidentiality and integrity.

Before the actual data transmission, the user must fill some information into the ARCAM data set uploading form, then the user can start the transmission of any number of files divided in two different typologies: data and documentation files. From a technological point of view, these two typologies are considered in the same way.

When a user selectsa file to be transmitted, the system splits it into chunks of fixed dimension then it sends each chunk to the server in a sequential way (from the first to the last); when an exception occurs (e.g. network fault, browser crash) or the user decides to close the session, the system has stored the fully arrived chunks in the temporary storage area so that it is possible to continue the transmission from the last stored chunk.

To continue the transmission, a user must select the same file partially uploaded (the system suggests the name) then a coherence test is performed: if n chunks were yet uploaded, then an hash of the first n chunks of the selected file is computed on the browser, this hash is sent to the server that checks if it matches the hash computed server side. If this check is positive, then the system continues to send chunks from the last fully uploaded. At the end of a file transmission the hash of the whole file is checked again to ensure its integrity.

To provide the possibility of managing huge files using HTTPS channel, the web application is development using the browser feature called XMLHttpRequest[4,5]; this feature is not included in some legacy versions of the most popular web browsers so, in this case,a courtesy page is shown when a user try to access the HTTPS transmission form.

The algorithm SHA-1 is used to compute the hash, it does not provide the same securitylevel of other algorithms (e.g. SHA-2), however it provides a reasonable security level and, from a performance perspective, it is more efficient [6].

Data confidentiality is guaranteed during the transmission (the channel is encrypted using HTTPS protocol) and only users of the same supplier can operate on their data sets, they can collaborate to upload the same data set (in different working session, not simultaneously) but they are not allowed to download a fully or partially uploaded data set.

2.2.Monitoring and managing of data transmissions

The system gives to administratorsthe access to a set of functionalities for real-timemonitoring and managing of data transmissions (Figure 2). Belongs the most relevant functionalities, administrators can:

  • check data transmissions status reports;
  • send reminders to users not complying with the assigned deadlines;
  • assign transmission channel (HTTPS, FTP or Web Service) to data set;
  • enable or disable data transmission for providers, i.e. external authorities or institutions sending data files through the system;
  • enable or disable ISTAT requesting users as end points of specific data transmissions;
  • reject specific data transmission due to errors or inconsistencies;
  • setup the new annual data transmission timetable using previous timetables as templates.

Figure2

2.3.Database

The database model meets the needs of managing bothdescriptive information of a dataset (e.g. name, institution supplier, reference period of data contents)and information relating to data acquisition (e.g. number of files, beginning and endof fileuploads). In order to accomplish this goal, the database joins information supplied by users during the transmission and preloaded metadata of the dataset.Login to ARCAM is performed using Istat LDAP [7]: for this reason no login details are stored in the database.

The information are updated through functionalities of the web application. Users have access according to their role and they can be classified as:

  • administratorusers: Istat users that monitor and update descriptive information. Moreover they manage transmission errors or irregularities and grants to dataset operations;
  • requesting users: authorized Istatusers performing operations on datasets;
  • provider users: supplier usersperforming file transmissions to ARCAM.

All information relating to users and their profiles are stored in tables whose scheme is in Figure 3.

Figure 3

Figure 4 indicates the relation that binds data transmission, administrative data sets and supplier users. This information are necessary to administrators in order to perform data monitoring and planning for following years.

Figure 4

Figure 5 indicates the relation that binds all information for the monitoring of data acquisition.

Figure 5

Finally, Figure 6 is the scheme that describes the association between users, data transmissions and operations that can be performed on them.

Figure 6

Database information are also used to promote datasets from the temporary storage area to the permanent repository that is the accessible area for requesting users.

3.Results

Inthe first year of activity (2016) many legacy channels of transmissions (e.g. dvd, cd-rom) were included into the unified process ARCAM, addingthe possibility of monitoring and more security to the transmissions. More than 50 suppliers have uploaded approximatively 400 files belonging to 400 administrative data sets for a total of 50GB.

4.Conclusions

The main purpose of ARCAM is to replace acquisitions of Admin Data that do not comply with the provisions of the Authority regarding the processing and security of personal data [8].

The new architecture ensures the integrity of supplied data and the respect of privacy regulations. ARCAM will also be used for receiving National Resident Population, through the use of the Porta di Dominio [9] (a security protocol used by Italian Public Administrative Institute for transfer data).

The ARCAM future development foresees the integration with the Unitary System of Metadata (SUM) and the ProgrammaStatisticoNazionale (PSN – National Statistics Program that classifies all statistics surveys that could be have public relevance).

References

[1]Modernisation at Istat: an operational model for both production process and organisational aspects. Piero Demetrio Falorsi and Nadia Mignolli (INS, Bucharest, Romania, March 17Th2016)

[2]Modernisation in Istat. Nadia Mignolli (Unece 2014)

[3]Reversing the flow: from an integrated system of administrative microdata to an infrastructure for the users. Simone Ambroselli and Giuseppe Garofalo (NTTS 2015)

[4]New Tricks in XMLHttpRequest2

[5]XMLHttpRequest Living Standard

[6]Crypto++ 5.6.0 Benchmarks.

[7]Network Working Group RFC 4511IETF.org. 2006-06-01. Retrieved 2014-04-04

[8]Agenzia per l’Italia Digitale, Presidenza del consiglio dei ministri, Linee guida sulla conservazione dei documenti informatici

[9]DigitPa Sistema pubblico di cooperazione: Porta di Dominio

1

[1]Italian National Institute of Statistics (Istat)