Session No. 5

Paper No.3

Country: Italy

The problem of links between legal units: statistical techniques for the enterprise identification and the analysis of continuity

Giuseppe Garofalo and Caterina Viviano

Italian National Statistical Institute

The authors are the only responsible for remaining mistakes and the opinions expressed do not involve the Italian National Statistical Institute

Enquiries and correspondence should be addressed to: or

Key words: statistical business register, enterprise, legal unit, enterprise demography, continuity, probabilistic linkage

1. Introduction

Since 1995 the Italian National Statistical Institute has developed a complex project for the setting-up of a statistical business register harmonised with the European Community regulation[1].

The first final setting-up has been carried out in 1997, with identification characters (enterprise name, address, etc.) updated at the 1996 and stratification ones (economic activity code, dimension, etc..) updated at the 1995. At the present the 97’ intermediate Census, projected as a sample survey, is used as check. The results will allow us to verify the adopted ASIA’s rules and methodologies and, when necessary, to modify them.

Planning and setting-up a complete and updated statistical register have to exploit - in order to be economically feasible - , in a synergical processing, all informative capacity on enterprises whose information is already stored by institutions or public administrations. So the Italian statistical business register (called ASIA) has been built up on the basis of the integration of data resident in administrative sources[2] and their treatment with statistical methodologies.

Whereas the use of administrative sources produces advantages in terms of a decrease in costs, time of data availability and enterprise burden, on the other hand it determines some methodological problems as regards the legal population represented by the administrative sources not corresponding to the real population that the statistic must represent. Unit definitions are different as well as the criteria of the unit inscription and cessation, frequently caused by motivations (e.g. tax evasion and elusion) not strictly connected to the economic reality.

How to link the legal units to the statistical ones, how to recognize real birth and death of an enterprise avoiding duplications of units and false demographic flows of enterprise? These are questions of not easy solution.

With the UE regulation[3] and the Eurostat recommendation[4] we have a clear and exhaustive table of concepts and definitions. In particular, the statistical definition of enterprise as “the smallest combination of legal units that is an organisational unit…” and the concept of statistical continuity give a clear indication how the statistical universe corresponds to a subset of the legal one and how it results from combinations of legal units complying with the criterion of an enterprise as a unique organizational structure.

Statisticians, first of all public statisticians, in order to produce not ambiguous information, have to own clear guidelines and ad hoc statistical techniques (feasible in terms of costs and time) allowing the combination of legal units (observed unit) with one another so that it is possible to identify the statistical unit and its characteristics, above all its demographic attributes.

The paper, after having briefly presented some statistical techniques useful for the reconstruction of an enterprise and the identification of non-demographic flows (spurious demography), develops a method based on the “similarity of attributes”. Results of the experimented application, using a file of records (154,000) referring to the Region “Marche” and taken from the fiscal register, are then shown.

2. Some statistical techniques

For some archives, as for ASIA, partially feeding by administrative sources, the definition of statistical continuity assumes the identification of the dynamic relationships existing between administrative entities that in appearance seem different. Whereas the static connections between administrative and statistical units are not solved a priori (for instance through surveys), the identified relationships among more legal units allow the reconstruction of the statistical unit enterprise.

Since our aim is the building up and the updating of a statistical register, techniques based on estimation criteria will not be taken into consideration and we will only deal with techniques permitting the exact identification of legal units relationships.

The identification of exact relationships can make use either of direct survey or of linkage techniques based on observable attributes of legal units. Direct surveys can present the usual problems of costs, burden for enterprises or, when the information on the legal units connections has been taken from administrative files, they have produced very partial results (see §3.).

Linkage techniques are different according to the available data and are based on the rule according to which just units having identification characters in common or others relevant attributes are linked one another. Three are the instruments can be used:

1)The identification of legal units links through employees flows. This technique is based on the analysis of employees flows between two (or more) units; it can be used especially in the analysis of mergers and demergers. It is based on the consideration that if the enterprise “b” takes over the enterprise “a” it will be possible to observe an immediate flow of (almost) all employees from “a” towards “b”. The methodological important element is based on the discrimination between “physiological” movements, produced by employees’ chooses, and “spurious” ones caused by transitions between enterprises. This technique needs individual longitudinal data on employees and cross section data on relationships between employees and enterprises. Those information are available in Italy just in the Social Security register. An experiment has been carried out in the feasibility plan of the ASIA register.

2)The identification of legal units links through the analysis of the presence at the same time of the same owners. This solution is based on the information available from the Register of Chamber of Commerce that records natural persons both owners and company partners. The logical consideration is that the same natural persons, present in more than one legal unit, either can identify one statistical unit or can be makers of spurious opening/closure of activities.

3)The identification of legal units links through the similarity of attributes analysis. Through this technique links between legal units are reconstructed on the base of the similarity of attributes like: enterprise name, localisation, economic activity, dimension – as fiscal turnover or employment -, juridical status.

The choice of one of the above techniques depends on purposes and data availability as well as on costs and time for data processing. Whereas the technique based on employees flows has been applied giving good results for the analysis of spurious demography for the medium-large enterprises, it cannot be used for the smaller enterprises and in general for those enterprises without employees that is the largest part of ASIA population.

The other techniques, even if more discretionary, have a better fit to the identification of links between small enterprises too.

3. The applied methodology

3.1 The used data base

The data base used for the experiment is the Fiscal Register. It records subjects (both natural and juridical) having any kind of relationship with the fiscal authority for direct and indirect tax payments.

This fiscal register is the base for the realization of the business register. In fact the “statistical” concept of enterprise is tied to the existence of one or more legal units (the enterprise does not exist if it does not refer to at least one legal unit), a unit is considered legal when it is recognized by the Government and performs certain duties towards the State administration. The starting hypothesis is that the first act that a legal unit carries out for its activity is the acquisition of the Fiscal Code at the tax register and a legal unit can't exist without having a fiscal code, therefore the operative rule adopted in the ASIA register is: the tax register is the basic universe of legal units.

But if the inscription at the fiscal register is a necessary condition in order to carry out an economic activity, it is not a sufficient one. In fact the real activity of production can start since many months have passed from the inscription, as the closure can be recorded in delay because there are some outstanding matters at the fiscal administration; a transformation in the juridical status causes a closure and a new inscription; an economic entity having a unique production organization can correspond to more fiscal subjects in order to have some accounting facilities

In few cases the fiscal register records links between legal units, they are merger and demerger, inheritance successions, transformations of juridical status. The coverage of such a registrations are limited and to make use just of them does not allow to cover as much as possible of the legal units relationships. It is important to say that these information have been very useful in order to estimate some parameters of the probabilistic linkage model applied as follow.

3.2 Background

The linkage procedures discussed in this paper are intended to perform the exact matching between two o more records containing the information referring to the same unit. The main purpose of this matching procedure is to link those legal units having a relationship so that they represent the same enterprise. On the basis of the information available on the data only some of these relationships can be identified. More complex relationships identifying enterprises groups are not here treated.

Given a set of (legal) units part of the same units represent duplications even if they can have a different identification code; for the last reason the linkages by codes are not useful for this scope and a different approach has to be proved.

For the identification of a statistical unit we have chosen 3 characteristics (even in accordance with the European recommendations rules about continuity) which information is well covered by the data base: the unit name, its address and the economic activity it is doing. Before making comparisons, each of these have been processed using standardization techniques in order to obtain components of easy comparability. The three components are:

1a component - the unit name, differentiating between Individual/family concerns and Companies. Standardization techniques are applied in a different manner to Individuals and to companies.

2a component - the address comprehensive of street number as the location where the activities are carried out.

3a component - the economic activities carried out that is the activity code at 5 digits (the Nace Rev.1 plus the fifth digit)

The present procedure is structured in two main steps. The first one concerns the application of some of the concept developed by Fellegi-Sunter[5] in the theory of probabilistic record linkage. More specifically the method, to which a number of modifications has been done, uses the relative frequencies of strings being compared. For instance a unit name that is relatively rare in pairs of records unit has more distinguishing power than a common one.

Results of the probabilistic method are further treated, in the second step, by the individuation through empirical checks and the application of a set of rules to Links of pairs records in order to choose just a part of them. The necessity to introduce this corrections depends both on methodological problems and on empirical economic based considerations. In respect to the first kind of problems we think the three variables are not very well discriminant; moreover at the chosen level of errors when the number of units is high (more than 1,000) the method, by construction, produces an increase of the number of false links. In general for specific combinations of the characteristics an ad hoc evaluation, in terms of economic interpretation, is necessary . Two Individuals (same surnames) , located at the same address ( a building in the city centre) doing the same economic activity (in financial intermediation ) are two different enterprises? or an enterprise with two legal units?

3.3 The probabilistic model : some elements

Let LA be the set of records (a) pertaining a population A with elements ai. We assume that certain elements (a, a’) relate to the same unit so that the space (AxA)=(a,a’) is the union of Matched and Nonmatched. Given a pair of records we want to decide whether they match or do not match and this decision is based on the observed comparison of three attribute items. If Y indicates the random variable given by the 23 possible outcomes 1/0 either the attribute agree/disagree, we do not know any a priori probability distribution under the H0 hypothesis (the distribution of matching) and H 1 (distribution of non matching). Using the three identification components as described in the previous order, let  = (1, 2, 3) denote the result of the comparison, whose components are the coded agreement or disagreement on each component. The set of all the realizations of  define the comparison space (AxA). Let indicate the conditional probability of  if the comparison (a,a’)  Matched by m( ) and u( ) if  Nonmatched, a total component weight w()=log[m( )/u()] is associated to each comparison pair.

The individuation of threshold values to which the total weight been compared is made by assigning a linkage rule through which by observing a comparison vector associated to a pair (a,a’) a decision allows to designate the pair as a link, a possible link or a nonlink. These decisions are subject to errors and the probabilities of these errors are defined as  and  :

=  u()P( Link ¦ )

 =  m()P( nonLink ¦ )

The linkage rule assures that for fixed level of errors it is minimized the probability of failing to classify links and nonlinks. Choosing an admissible couple (,) and by cumulating u() and m() until the fixed level of error is reached, two cut-off (Upper and Lower) points are determined and the linkage rule takes the form:

If w() > Upper the (a,a’) is a link

If Lower  w()  Upper is a possible link

If w()< Lower is a non link

All details on the parameters calculations are shown in Appendix 1.

3.4 The application

It has been carried out for a Region of Italy which information is contained in a file LA consisting of 154,000 records. Both for the peculiarity of this analysis and for technical reasons the probabilistic linkage method is applied to units that are in a subset of the complete space (AxA) i.e. those pairs agreeing on a specific string that identify the blocking criteria: the geographical zone of ‘Comune’, inside a Province in the Region.

The following results are from the application of some logical steps to a subset of the experimental file of data. The subset contains 34,095 records of units relating to the Province of Pesaro with 66 number of delimited geographical zone.

Step 1 – The comparison space. The probabilistic linkage procedure is applied on the basis of the frequencies model . Considering that it is not computationally practicable to base the counts on the complete product space, the procedure is applied on the set agreeing on the blocking chosen criterion ; moreover comparison vectors (y) whose values components were disagreeing at all (0,0,0) and for the economic activity only (0,0,1) have been excluded from the following analysis. The number of records pairs subjecting to a decision decrease to 122,353

Step 2 – The probabilistic test procedure. The set of pairs chosen by the probabilistic linkage procedure consist of 12,679 links. This result is obtained on the basis of the threshold values calculated as described in § 3.3 at the level of error equal to 0.001. We arbitrarily decided to adjust these threshold values looking at the distribution of total weights[6]. We expect a threshold be more efficient when allocated in the lowest point of the U-shaped curve, point in which it is supposed matched and non matched distributions are separated. This corrections lead to slightly modify the threshold values so that all the pair records having a comparisons vector composition (0,1,0) and (1,0,0) are not further involved in the following analysis. The n° of links drops at 7,010.

Step 3 – From Links to Clusters of links. Sometimes the same unit can be classified in more than a link ; for these cases a complex link involving all the units directly or less directly linked to it is reconstructed and is called cluster. A cluster can be composed by one or more linked records pairs; some links became part of a cluster can be duplicated in another cluster. The choice of all or some links implied in a cluster is a logical step to do in order to avoid duplications (the same unit linked many times to others). The number of clusters amounts to 4,155 ( with a number of 7,667 pairs ).

Step 4 – Some logical considerations on multiple links: a links space reduction. Looking at the clusters composition, given that the theoretic model results have not been satisfactory, the logical step consists of a reasoned evaluation based on the distribution of the total weight associated to each pairs inside the same cluster; in fact the higher the weight the stronger the link. For this reason in presence of more then one link whereas a dominant link exists it is preferred to catch this best link (Dominant) and to leave the remaining. When a dominant is not well identified, what happens can be identified both in presence of large clusters where more links have high weights (Multiple Dominants) and when from weights distribution no dominant link arises. In the latter case the aim is to avoid including those links having a very low weight in comparison to the others ; in doing this links falling in the first quartile range of the weight distribution have been eliminated. The links reduction process can be summarised with the involved numbers in the following picture a. A detailed information about the distribution of the number of links, clusters and reduced clusters by Comune is presented in Appendix 2.

Picture a

LINKs
7,667
E / MD / D / ND / UQ
1,360 / 38 / 3,765 / 1,467 / 1,037
Manual / Remaining
Evaluation / Links
5,232
Number of
Clusters
4,060

Where:

E: Eliminated ND: No Dominant

UQ: Under Quartile MD: Multiple dominant

D : Dominant

Step 5 – Coherence rules application. Even if the links inside each cluster can be defined as correct as possible from a theoretical point of view, it is clear that they need an empirical check. In fact by construction, a link can be classified correct but just looking at the composition its characteristics it should be dropped. For example it can happen that two subjects doing a similar economic activity (i.e. consultancy or legal activities, software house, etc..) be placed at the same address (the same building) , they are correctly identified as link by the methodology (especially if they have an high weight) whereas no real association exists. In order to understand and evaluate with a certain degree of confidence the correctness of the found clusters a check has been carried out on 3,388 clusters (for the whole Region) identified by the previous steps. The results have been used in order to define some coherence rules to be applied to the set of links and that correctly allow to identify either not real demographic events (spurious demography) or more then one different legal units referring to the same economic entity, the same enterprise. This check has been carried out by reading the information, not organised in a standardized way, coming from the administrative register of the Chamber of Commerce, consulted on line. The relevant information focused on the individuation of the ownership composition of the enterprise , changes of address, owners death, changes on juridical status. In general the application of those rules aim to choose among those pairs already defined as link focusing on the analysis of the whole cluster (just for those clusters with more than 1 to 3 pairs). A detailed description of the coherence rules are given in Appendix 3. At the end of the rules application a new unit consisting of single pairs (a link) in a cluster, some pairs inside a cluster, the whole cluster , represents those legal units jointed . A detailed distribution by Comune and clusters composition is presented in Appendix 4.