PREPARATION OF THE PORTUGUESE DATA WAREHOUSE FOR EXCHANGE OF MICRO-DATA ON INTRA-EU TRADE BETWEEN MEMBER STATES (SIMSTAT PILOT)
Contents
PREPARATION OF THE PORTUGUESE DATA WAREHOUSE FOR EXCHANGE OF MICRO-DATA ON INTRA-EU TRADE BETWEEN MEMBER STATES (SIMSTAT PILOT)
Introduction
Data Warehouse and ETL Processes
Business Intelligence for Data Quality Evaluation
Introduction
This paper provides a brief explanation on how the Portuguese NSI developed an informational infrastructure to receive micro-data (enterprise level data) on intra-EU exports in goods exchanged between Member States. The monthly processes to make corrections possible, updates and the incorporation of new transactions are presented.
The data mart to enable data comparison and to allow data quality evaluation is described.
Data Warehouse and ETL Processes
During the experiment in April 2014 data was verified and loaded directly onto the database without going through any particular processes because it was the first version of 2012 and 2013 ever received. From April 2015 however we will receive from the other MSs data which can be new and thus inserted and data which should be removed. The process described was tested on data received from Germany which already had records with record action D.
All records received from a member state are loaded in a table in the staging area (SA) following a nomenclature XX_simstat_AAAAMM_vN as thus:
-XX designates the country, codes follow iso-alpha-2
-AAAA indicates the year in which the data is being received.
-MM for the month in which the data is being received, codes go from 01 to 12.
-N specifies the version, codes can be:
- 0 for chunks of data;
- 1 for data reported 20 days after the reference period;
- 2 for data reported 25 days after the reference period;
- 3 for data reported 30 days after the reference period;
- 4 for data reported 35 days after the reference period;
In the SA data is cleansed, extra spaces are removed, the timestamp format is enforced and some minor verifications concerning the validity of the commodity codes are performed.
The data is then divided in new data to be inserted (table simstat_i) and data to be removed(table tf_simstat_d), according to their record action. These new tables still follow the design of the exchanged file but are moved to the operational data store (ODS).
After this separation records which indicate a remove action are inspected for their reference period. The current policy in Portugal is to revise only the last two years, so any record which means deletions on older records is not executed but reported to the analysts’ team for further inspection. Following the team decision these records can be deleted later on.
In order to perform the delete action which can act upon data already on the data warehouse or data received in the current file a temporary table (tf_simstat_i) is assembled in the ODS with all the data from the current MS in the last two years. Data from the current insert table is added. It is not clear if after a record is deleted its unique record identifier (URI) will be reused which can originate multiple delete statements over the same URI with different time production timestamps. The system was built to be able to deal with this problem should it arise. A selection grouping all the records by URI is performed on the table with the delete statements to check the number of times the deletion process should be executed upon the data, iterations_number.
The deletion process begins by building a table to store all the removed information. This table in the ODS will at the end of the process be directed to a data mart to enable the analysts to verify all the operations. The table XX_simstat_removal_<E> will have has many instances as there will be executions and will be consolidated at the end of the process. In this table all the information from the original record will be stored side by side with information from the record that issued its deletion. This operation is performed in order to enable the team to be sure about the deletions performed but also to allow updates to be executed. More on this subject will follow after the deletion process explanation.
In each erase iteration the records to be deleted from the target table (tf_simstat_i) are picked not only by their URI but also selecting the records with the earliest time production stamp to prevent that inserts with the same URI performed after the deletion be mistakenly removed.
Table XX_simstat_removal_<E>data from record with RA = 'I' / data from record with RA = 'D'
URI / time_production / sending_ms / … / del_URI / del_time_production / del_sending_ms / del…
These records are then entered in the control table XX_simstat_removal_<E> on the leftmost columns. The records which issued the removal from the deletion table tf_simstat_d are also picked by their URI and earliest time production stamp, as a record if their URI is reused can be inserted and deleted several times over. Their information is then entered in the control table on the rightmost columns.
Several checks are the performed to ensure that:
-Time_Production is earlier that Del_Time_Production
-All the other fields have the same value
If one of these tests fails the record is marked in the remove flag not to be removed (code N) and the analysts team is alerted to the fact.
The erase process continues by marking all the records on the insert table (tf_simstat_i) which are in the control tableXX_simstat_removal_<E> with remove_flag different from N for deletion. And they are then physically removed from the table.
In an analogous process all the records on the delete table (tf_simstat_d) which are in the control tableXX_simstat_removal_<E> with remove_flag different from N are marked and physically removed from the table tf_simstat_d.
When our iterations_number is bigger than 1 there should still be records on the delete table tf_simstat_d and all the process is performed again. The process is executed recursively until the number of records on table tf_simstat_d reaches 0 and the number of executions reaches iterations_number.
The records on the target insert table tf_simstat_i are then placed in the data warehouse, in the mirror dispatch datamart, replacing the original piece used to build the table in the ODS.
Update Processes
During the examination of the records on control table XX_simstat_removal_<E> where the fields on the deletion record do not match those previously on the insertion record the analysts may decide to perform an update.
The update is done removing the previous record with record action I on the target table tf_simstat_i and inserting on this table the record with record action D which is then changed to I. The field remove_flag on the control table XX_simstat_removal_<E> is then altered to U to indicate an update.
Business Intelligence for Data Quality Evaluation
The data receivedis used to link the nationally collected arrival data with the corresponding mirror dispatch data exchanged and to analyse the degree of match between the two data sources.
To give a first assessment of the usability of the micro data as a possible alternative source for nationally collected arrival data a data mart was built with data received from the other MS and information already available in the Portuguese Data Warehouse.
The mirror dispatch data mart refreshment processes were already described in the previous section and a Business Intelligence (BI) model was designed on top to allow the necessary quality studies.
The analyses focuses on how well the mirror dispatch micro data matches the national data and several reports were designed to accomplish the tables proposed by the Quality Evaluation Plan.
The nationally collected arrival data in the data warehouse was copied to a mirror data mart to be alongside the information received from the MS and transformed accordingly for example concatenating year and month to obtain the Reference Period.
In the BI the information that comes from the nationally collected arrival data mart is always preceded by SLVG to indicate the provider.
Having the data with the same structure linked within the data mart allows for value comparisons but also enables us to discriminate the values in which the PSIs were both identified in the receiver and sending data bases or in just one of the cases (Quality Report Table1).
The difference in value (Quality Report Table 2 and 7) between the national arrivals and the mirror dispatches is calculated in the same fashion.
The BI allows the construction of variables as the real percentage difference derived from other variables in the report and also the building of classes for match level (Quality Report Table 3 and 8).
Furthermore the commodity code was subdivided in 2, 4 and 6 digits to enable this intermediate aggregations (Quality Report Table 4, 5 and 10)and counters of lines were introduced in the dispatch data received (Quality Report Table 7).
Although the BI tool enables reports reuse and document scheduling, the analyses through the different countries may have varying conditions which have to be manually accounted for.