Guidelines for Phase 1 of the Project Micro Data Linking of Structural Business Statistics

Guidelines for Phase 1 of the Project Micro Data Linking of Structural Business Statistics

[Skriv tekst]

Guidelines for phase 1 of the project “Micro Data Linking of structural business statistics and other business statistics”

Table of Contents

1MDL database in general

2Phases of the Micro Data Linking project

3Database structure

4Design of database

4.1Core population of the MDL database

4.2Entities in the MDL database

5Building the MDL database

5.1Dataset of core population

6Data sources and variables

6.1Data source 1, National Business Register

6.2Data source 2, Structural Business Statistics

6.3Data source 3, International Trade in Goods Statistics

6.4Data source 4, International Trade in Services Statistics

6.5Data source 5, Community Innovation Survey

6.6Data source 6, ICT Usage and e-Commerce in enterprises Survey

6.7Data source 7, Outward Foreign Affiliates Statistics

6.8Data source 8, Inward Foreign Affiliates Statistics

6.9Data source 9, Business Demography Statistics

6.10Data source 10, International Organization and sourcing survey

6.11No-match populations

7Time table and deliverables

7.1Timetable

7.2Deliverables

8Description of and guidance to the delivered syntax

8.1Modification of syntax

Annex A: Variable lists

Annex B: Description of GVC dataset

This document contains detailed guidelines for the Eurostat project on micro data linking of structural business statistics and other business statistics (Topic 1).

The project will establish national databases containing the most central structural business statistics, with information available at enterprise level. This will be used to conduct micro-level economic analyses of essential issues such as differences in economic performance in exporting and non-exporting enterprises or innovative enterprises.

Beyond this, the project will provide the basis for further analyses in the future, based on the national databases established in the project.

1 MDL database in general

To the extent possible, the new MDL database (MDL 2014) will be structured using input data for the reference period of 2008-2012 from Structural Business Statistics, International Trade in Goods Statistics, International Trade in Services Statistics, Community Innovation Survey, ICT usage and e-Commerce in enterprises Survey, Foreign Affiliate Statistics (Inward and Outward), Business Demography Statistics, International Organization and Sourcing Survey and the national Business Register.

Information from the National Business Register will be central for establishment of the database as the issue of identity over time is essential for longitudinal micro-level analysis.

It is the aim of this project to establish datasets with micro data for consolidated longitudinal panels during the period 2008-2012 controlled for demographic events such as mergers, acquisitions, enterprise group relations, birth and death of enterprises. Separate guidelines for this part of the project will be provided during phase 2 (starting in June 2014). The three phases of the project are presented shortly in the following section.

2 Phases of the Micro Data Linking project

Three project phases

The micro data linking project is divided into three phases:

  1. Matching and adjustment of data and structuring of database

(Data sources, variables and reference periods)

  1. Validation controls and calculation of weights for the control group population(s) where necessary

(Demographic events and relations and weights for control group according to the composition of the target population (which will be defined for a given analysis, e.g. innovative enterprises)

  1. Production of standardized output

(Descriptive statistics, longitudinal and regression analysis)

The micro data linking project and its phases are illustrated in figure 1.

Figure 1: The phases of the MDL 2014-2015 project

Phases 1, 2 and 3 will run simultaneously in 2014 to some extend, while in 2015 the focus of the project will be on phase 3 and also summarizing the MDL methodology established in the project.

This document contains guidelines for phase 1, matching and adjustment of variables from different data sources and structuring of datasets.

3 Database structure

Annual dataset for each data source

On the basis of the 10[1][AP1] data sources with 5 reference years each (besides the GVC survey), the database will hold a total of 46 datasets.

9 data sources will be stored in 5 annual datasets with information aggregated on enterprise level. Furthermore there will be one dataset for the International Organization and Sourcing Survey that covers the whole period from 2009 to 2011. Finally the database vil will hold one dataset with key information about the coverage of the database (see section 4.1).

Enterprise ID as the primary key

Each of the datasets will hold selected variables from the data source (see the detailed sections for each datasource, as well as Annex A). The primary key of the database will be the enterprise ID (ENT_ID). Information from different sources will be extracted using the whole population of the given data source on enterprise level (ENT_ID).

However, the enterprise group ID (ENTgr_ID) will be stored with the enterprise ID in the annual datasets, with information from the Business Register and it will be used as a secondary key in the database. This information will be used for the purpose of linking of OFATS data, for the purpose of validation of data and also possibly in order to aggregate data for analysis.

4 Design of database

4.1 Core population of the MDL database

All unique enterprise IDs in 2008-2012 will be the core population

Initially the core population of the database is planned to cover all active[2][AP2] enterprises according to the SBS statistics for each of the years 2008-2012. However, in order to keep as much information as possible in the MDL database, the core population will be constructed of unique enterprise IDs for all enterprises from all sources (not delineated to the non-financial business economy) for all reference periods between 2008 and 2012.

Thus, the core population will hold ENT_IDs that might not exist in each of the reference years in one or more of the data sources, but have prevailed at least once in one or more of the statistical sources (besides the Business Register) during the period of 2008-2012.

For the purpose of structuring data and documentation, one dataset with key information will be stored in the database. This dataset will hold all unique enterprise IDs and express which data sources hold information on the given ENT_ID.

Table 1: Example of dataset with core population of the MDL database

Table 1 illustrates how the dataset with the core population of the MDL database will look. For each unique ENT_ID variables describing in which reference periods of the different data sources this particular enterprise ID is present (1=ENT_ID prevails in the dataset, 0= ENT_ID does not prevail in the dataset) will be stored.

In the example above, the first record shows that ENT_ID A is present in all the mentioned sources for all reference years, while the second row illustrates that ENT_ID B that is not found in SBS 2010 and SBS 2012; the ENT_ID C described in the third row is not present in the SBS 2008, SBS 2010, SBS2012, ITGS 2009 and ITS 2009).

The core population of the database is expected to exceed the number of unique enterprise IDs in each of the data sources due to the fact that we will include enterprise IDs that only prevail in one or few data sources. This approach will be helpful[AP3] during the validation and analytical phase of the project.

No total compliance between data sources

Various data sources in the database do not cover the whole core population as described above. Methods for imputation and repetitive re-weighting to correct for selective samples will be considered and implemented where necessary at a later stage of the project.

4.2 Entities in the MDL database

Enterprise levelEnterprise group level

As some type of information related to for instance foreign affiliates, R&D activities or exports might be more correctly linked and interpreted at the level of the enterprise group, it is the intention – if the methodological analysis proves the feasibility of this – to aggregate, disaggregate and analyze data from the different data sources at both enterprise and enterprise group level.

Figure 2: Structure of the MDL 2014-2015 database

5 Building the MDL database

5.1 Dataset of core population

Establish core population

As mentioned earlier the MDL database will take its offset in the core population of enterprise identification numbers that exist in one or more years (2008-2012) in one or more of the following data sources: SBS, ITGS, ITS, CIS, EC, OFATS, IFATS, BD and GVC.

All the enterprise IDs from the core population will be stored in each of the National? Business Register datasets for 2008, 2009, 2010, 2011 and 2012 with the relevant background information. Thus, even IDs of inactive enterprises in a given reference year might be in the dataset – in that case all values for the particular reference year will be missing (.).

Each enterprise ID will occur in the core population of the MDL database only once.

Standardized syntax will use EU-variable names to build the database files

In order to make construction of the database less time consuming, the standard syntax provided along with this guideline will help re-naming and formatting the EU-variables to MDL standard.

Each of the datasets described in the next section will hold selected variables from the relevant data source and the primary key to connect the datasets will be the enterprise ID (ENT_ID).

Information on the link between ENT_ID and enterprise group ID (ENTgr_ID) will be stored in the dataset with information from the national Business Register and will be used during the validation phase and for production of standardized output.

Output will be produced using standard syntax

Analytical output will be produced at a later stage of the project, using standardized syntax that will combine the relevant datasets and extract the required information in a harmonized manner in each participating NSI.

6 Data sources and variables

Reference period for all sources: 2008-2012

The period of measurement to be included in this project is 2008-2012. All data sources should be available covering the whole period at the latest in autumn 2014. The International Organization and Sourcing Survey data covers the period of 2009-2011 only.

Data sources for the MDL 2014-2015 database are:

  • BR, National Business Register, 2008-2012
  • SBS, Structural Business Statistics, 2008-2012
  • ITGS, International Trade in Goods Statistics, 2008-2012
  • IS, International Trade in Services Statistics, 2008-2012
  • CIS, Community Innovation Survey, 2008-2012
  • ECICTeC[AP4], ICT usage and e-Commerce in Enterprises Survey, 2008-2012
  • IFATS/OFATS, Foreign Affiliate Statistics, 2008-2012
  • BD, Business Demography Statistics, 2008-2012
  • GVC, International Organization and Sourcing Survey, 2009-2011

If not available initially, data for the reference period 2012 should be added to the database when available. All guidelines will initially assume availability of this reference period in each NSI.

The variables to be included in the database have been selected on the basis of the metadata reports from the participating countries. It has been decided to exclude a few variables from the project. The decision were taken first of all due to low availability but also after an assessment of the variable’s contribution to the database.

The following variable has been excluded:

- BR: END_NACE_M (End date for the main activity)

- BR: END_NACE_S (End date for the secondary activity)

- CIS: MKTMET (New or significantly changed sales or distribution methods)

- ECICTeC: EMPINTRAPCT (% of workers with access to intranet)

Moreover, nace NACE codes have been excluded from all data sources except BR.

Each of the data sources for the database and the contents to be included in the analysis are presented in the next section.

6.1 Data source 1, National Business Register

Data 1: BR

The datasets with information from the national Business Register should hold all enterprise IDs from the core population in each of the annual Business Register datasets. Thus, even IDs of inactive enterprises in a given reference year might be in the dataset – in that case all values for the particular reference year will be missing (.).

Therefore, the datasets with information from the national Business Register will exceed the amount of records in the annual SBS datasets. Including information on all enterprise IDs during the whole period helps[AP5] data mining in order to establish longitudinal panels without searching for data in sources outside the MDL database.

The datasets based on the national Business Register will first and foremost hold information on the link between the unique enterprise ID and unique enterprise group ID[3].

The following information will be included in these datasets where possible:

Information from BR

  • Enterprise ID
  • Enterprise Group ID
  • Administrative ID
  • Start date for the enterprise ID
  • End date for the enterprise ID
  • Start date for the enterprise Group ID
  • End date for the enterprise Group ID
  • Legal form of the enterprise ID
  • Main activity of the enterprise
  • Secondary activity of the enterprise
  • Ownership of the enterprise (private/public)
  • Start date for the main activity
  • Start date for the secondary activity
  • Ownership relation with associated direct ownership indicated as percentages for each enterprise ID
  • Information on demographic relations (mergers and acquisitions etc.)

6.2 Data source 2, Structural Business Statistics

Data 2: SBS

The population of enterprises from Structural Business Statistics (all size classes) with unique enterprise ID numbers will be the base of SBS datasets in the MDL database.

Thus the annual datasets with SBS will hold enterprise IDs of all active enterprises in the given reference year. In result, the MDL database will cover all ENT_IDs that prevail at least once in this data source.

Later in phase 2 of the project, information from the national Business Register will be used to examine population(s) of ENT_IDs that do not exist in all of the reference years between 2008 and 2012 in order to investigate and if necessary correct the population for demographic events.

Information from SBS to be included in the database is[AP6]:

Information from SBS

  • Turnover
  • Value added at factor cost
  • Gross operating surplus
  • Total purchases of goods and services
  • Personnel costs
  • Wages and salaries
  • Number of employees
  • Number of employees in full-time equivalents

Derived variables based on these, i.e. turnover, value added, personnel costs, total purchases of goods and services, and gross operating surplus per employee will be calculated automatically using a standard syntax provided by the project coordinators, at a later stage of the project.

6.3 Data source 3, International Trade in Goods Statistics

Data 3: ITGS

All enterprises from the annual International Trade in Goods statistics will be included in the MDL database. This data source might have multiple records for each unique enterprise ID. Therefore, these datasets records?? will be aggregated during production of standardized output according to the need for information to be extracted.

Annual information from ITGS to be included is:

Information from ITGS

  • Import amount by country of origin and product
  • Export amount by country of destination and product

6.4 Data source 4, International Trade in Services Statistics

Data 4: ITS

Population of enterprises IDs in the International Trade in Services statistics for the period of 2008-2012 will be included in the database.

International Trade in Services statistics might have multiple records for each unique enterprise ID. Thus, this dataset records?? will be aggregated during production of standardized output according to the need for information to be extracted.

Information from ITS to be included is:

Information from ITS

  • Import amount by country of origin and EBOPS[4]
  • Export amount by country of destination and EBOPS

6.5 Data source 5, Community Innovation Survey

Data 5: CIS

All enterprise IDs and selected variables from the Community Innovation Survey for the period of 2008-2012 will be included in the database.

Data from CIS is expected to be available on the level of enterprise, so each ENT_ID will only occur in this dataset once per reference year. However, some enterprises might be reporting on behalf of more enterprises. Corrections for such information will be handled during validation of the database in phase 2 of the project.

Types of information from CIS to be included in the database are:

Information from CIS

  • Product innovation by technology level
  • Process innovation by technology level
  • Organizational innovation by technology level
  • Marketing innovation by technology level
  • Engagement in R&D
  • R&D costs
  • R&D employment

6.6 Data source 6, ICT Usage and e-Commerce in enterprises Survey

Data 6: ECICTeC

Population of enterprises from the ICT Usage survey (ECICTeC) for the period of 2008-2012 and associated variables will also be included in the MDL database.

Types of information from EC ICTeC for data 6 are:

Information from ECICTeC

  • ICT and ICT security and ICT-systems in enterprise
  • Internet access and access ways, homepage and use of internet
  • Amount of e-sales
  • Amount of internet purchases
  • Amount of electronic data exchange with external ICT system

6.7 Data source 7, Outward Foreign Affiliates Statistics

Data 7: OFATS

All available enterprise IDs and enterprise group IDs from Outward Foreign Affiliates Statistics (OFATS) for the period 2008-2012 will be included in the database.

Data on foreign affiliates might contain multiple records for each unique enterprise ID number. In order to keep most information in the database for later use, this these dataset records?? will also be aggregated during production of standardized output according to the need for information to be extracted.

Both enterprises that are ultimately owned by the compiling country (UCI=compiling country) and ultimately owned from abroad (where available) should be included[AP7]. Affiliates iBoth intra-EU and extra-EU affiliates will be covered.

Types of information from OFATS to be included in the database are:

Information from OFATS

  • Number of foreign affiliates by NACE and host country
  • Number of persons employed in foreign affiliates by NACE and host country
  • Turnover of foreign affiliates by NACE and host country

6.8 Data source 8, Inward Foreign Affiliates Statistics

Data 8: IFATS

Information on all ENT_IDs from the Inward FATS for each of the years 2008-2012 will also be included in the database.