On Data Integration
Work Package 4–Case studies
23-05-2011
First Steps in Profiling Italian Patenting Enterprises
Daniela Ichim, Giulio Perani, Giovanni Seri
{ichim,perani,seri}@istat.it
Version 1.0
ISTAT, Italian National Statistical Institute
Via C. Balbo, 16
00184, Rome
Italy
Deliverable No: DI WP4-IT
Abstract
The paper describes the record linkage scheme followed at the Italian national statistical institute to match micro-data on patent application from the international database PATSTAT with the data available from the Italian Official Business Register (ASIA).
The target data in PATSTAT are the applicants based in Italy registering patent/s in the period 1985-2010. Patents applicants can be ‘individuals’ or ‘establishments’. In this last category we aim at identifying business enterprises who were active (as recorded in ASIA) in the period 1989-2008. The wishing output of the linkage process is, for each patenting enterprise, a pair composed by the ‘applicant identification code in PATSTAT’ and the ‘enterprise identification number in ASIA’. This last allows for accessing the repositories of the official statistical data and, therefore, linking economic data to patenting enterprises. Statistical analysis such as: identifying the premises of patenting propensity; evaluate the impact of patenting on the enterprise profitability; etc. can be then performed.
On the methodological side, linkage of patent data has to rely on the ‘applicants names’. Consequently, a great effort has been put in the pre-processing phase of the process to standardise the applicant/enterprise names and extract the ‘legal form’ from the name string. During the linkage process, two practical problems were faced: the reduced number of comparison variables and the huge dimension, in terms of number records, of the Italian Business Register. These issues were addressed within a rule-based deterministic record linkage approach. In this paper, together with the results obtained, we will illustrate the main features of the sequential searching and linkage methodology we adopted.
Table of contents
PageIntroduction / 4
1. A general description of patenting administrative flows / 5
2. Data sources / 6
3. Data pre-processing and standardisation / 9
4. The record linkage process / 11
5. Preliminary results / 17
6. Conclusions and future plans / 19
References / 19
Introduction
In this report we will describe a preliminary stage of an Istat project aiming, mainly, at monitoring and profiling Italian patenting enterprises. A complete characterisation of such enterprises might allow, for example, updating the survey frame list of potential Research and Development (R&D) performers, might favour investigation of specific subpopulations like biotech enterprises. Moreover, from a statistical analysis point of view, identification of patenting enterprises enables their linking to structural characteristics. Thus, factors influencing patenting propensity of enterprises might be studied, as well as, the economic impact of patenting activity.
The preliminary stage we are concerned with in this report is the design of a strategy aiming at the unambiguous identification of Italian patenting enterprises.
This document is divided in six sections. In section 1, a general description of patenting administrative flows is given. In section 2, we discuss the selection of the databases to work with. A brief description of these databases is provided. In section 3, details on the applied standardization procedure are reported. The record linkage methodology as applied to these particular datasets is illustrated in section 4. Due to the reduced number of comparison variables and to the huge amount of data, we had to deal with, in section 4, emphasis is put on search space reduction methods. Finally, in section 5, we present the obtained results. Some conclusions and ideas for further implementations, analyses and research are given in the last section.
1. A general description of patenting administrative flows
A patent is an exclusive right granted for an invention, which is a product or a process that provides, in general, a new way of doing something, or offers a new technical solution to a problem. In order to be patentable, the invention must fulfill certain conditions. Namely, it must be of practicaluse; it must show an element of novelty, that is, some new characteristic which is not known in the body of existing knowledge in its technical field. This body of existing knowledge is called "prior art". The invention must show an inventive step which could not be deduced by a person with average knowledge of the technical field. Finally, its subject matter must be accepted as "patentable" under law.
A patent is granted by a national patent office or by a regional office that does the work for a number of countries, such as the European Patent Office. Under such regional systems, an applicant requests protection for the invention in one or more countries, and each country decides as to whether to offer patent protection within its borders.A patent provides protection for the invention to the owner of the patent. The protection is granted for a limited period.
A patent owner has the right to decide who may - or may not - use the patented invention for the period in which the invention is protected. The patent owner may give permission to, or license, other parties to use the invention on mutually agreed terms. The owner may also sell the right to the invention to someone else, who will then become the new owner of the patent. Once a patent expires, the protection ends, and an invention enters the public domain, that is, the owner no longer holds exclusive rights to the invention, which becomes available to commercial exploitation by others.
The first step in securing a patent is the filing of a patent application. The patent application generally contains the title of the invention, as well as an indication of its technical field; it must include the background and a description of the invention.
There are 3 main actors in any administrative patenting flow: the inventor, the owner and the applicant. A special feature of a patenting process is that the inventor, the owner and the applicant might be different subjects (each referring to one or more entities).
A special case of the relationship inventor-owner-applicant is provided by the patents whose original idea ‘born’ in enterprises where the general manger (head of the company) is also the owner of the enterprises. Sometimes, the manager is the patent owner while in other cases the patent owner is the enterprise itself. In both cases, the inventor might be a completely different person (for example a researcher employed by the enterprise) as well as the applicant (for example a notary’s office offering patenting services).
2. Data sources
To our knowledge, the most complete and updated database on patents is the European Patent Office (EPO) database “Worldwide Patent Statistical Database”, called PATSTAT. Much of the raw data in PATSTAT is extracted from the EPO's master bibliographic database DOCDB, also known as the EPO Patent Information Resource. PATSTAT is updated twice a year (April and October). PATSTAT is a relational database containing 20 tables with more than 70 millions of records (63 millions patent applications) from over 80 countries. Other sources on patents either concern only regional applications (like Ufficio Italiano Brevetti e Marchi) or offer only data extraction and analyses services.
PATSTAT registers mainly information on patent applications. To reach our goal (identification of Italian patenting enterprises), in this work we concentrated only on the two tables depicted in Figure1. The link between them is given by the unique values of the field Application Number, or alternatively,Publication Number. The Application Numberalso contains the patent year of registration. The time period covered by the database is given by the years 1985-2010.There is no explicit database field concerning the legal form of the inventor, owner or applicant. PATSTAT registers both the inventor and applicant name; only the latter was used in this work. The possible legal form should be extracted from those names. About the applicant, PATSTAT also registers its address (street, city, postal code) and its country code. Only applicants based in Italy, i.e. COUNTRY_CODE = “IT”, were selected from PATSTAT tables. At this stage of the work, the postal code was used as geographical location assuming it has the same accuracy as the address. We plan to rely on the detailed address to assess the linkage quality or to elucidate special cases like those in which the applicant is the manager (or owner) of an enterprise. These aspects are not reported in this document.
About the patent, PATSTAT registers its IPC (International Patent Classification), its application and publication number. It is worth noting that a patent could have assigned more than one IPC codes. Indeed, while in the second table of Figure 1 each record corresponds to a unique Application Number(about70000), in the first table the number of records is around 300000.Moreover, it should be stressed that there is no formal/well-defined relationship between IPC codes and the principal economic activity classification (NACE).
Figure 1. Used database tables from PATSTAT; COUNTRY_CODE = “IT”.
The most recent versions of PATSTAT also include a standardized version of the applicant name[1]. In our case study, this standardization was ignored because it is not fully compliant with Italian enterprise names, it includes the legal form. Moreover, as our goal is to link the patent applications to an enterprise register, we should apply the same standardization process to the selected enterprise register too.
Applicants may be classified as individuals or establishments. These latter, according to the Frascati manual, see OECD (2002), could be: business enterprises, public institutions, non-profit institutions and private or public universities. In this work, the aim is the identification of patenting enterprises. The complete classification of applications will be performed in later stages. A distinction between business enterprises and natural individuals could be favoured by a catalogue of Italian first names. Istat provides such a list, stemming from surveys on population register. Alternatively, a list of Italian first names may be downloaded from From preliminary investigations, other data sources potentially helpful in profiling the Italian patent applications might concern the general managers of large enterprises and academia researchers.
Additional details on PATSTAT may be found at
On enterprises, many registers might be available in Italy, with different degrees of accessibility. From the Istat point of view, the most important is surely ASIA (Archivio Statistico delle Imprese Attive). ASIA is developed, updated and maintained through the statistical integration of different administrative sources(Tax Register, Register of Enterprises and Local Units, Social Security Register, Work Accident Insurance Register, Register of the Electric Power Board), covering the entire population of enterprises of industry and services, other minor archives available (covering particular sectors), and structural business statistics currently produced by Istat. ASIA is a business register used in many different business survey stages, e.g. sampling frame, post-stratification, calibration, etc.
Among the variables included in ASIA, one may specify:
a)Enterprises Identification Number (an Istat internal identification codeallowing linkage to whatever economical information on the same unit collected by Istat); this identification code is unique for each enterprise
b)Enterprises Name
c)Zip Code
d)NACE code
e)Geographical information (address, municipality, province, region),
f)Legal form
Other variables that could be useful are the Fiscal Code, Number of employees and Turnover.
It should be observed that only Enterprise Name and Zipcode are overlapping with the information contained in PATSTAT.
According to the ASIA reliability and availability only enterprises that were active in the period 1998-2008 have been taken into consideration.Consequently, in this work, it was assumed that an enterprise was active during the year it applied for a patent. In this report we refer only to the selection of ASIA corresponding exclusively to active enterprises.
To give an idea of the numerical complexity of the problem we report, in table 1, the number of active enterprises is presented. In the second column, for each year, we also present the percentage of active enterprises with more than 1 employee, which is almost constant, around 40%.As it may be observed, the union of different versions of ASIA contains more than 47 millions of records.
YEAR / Thousands of active Enterprises / % ofactive enterprises with more than 1 employee1998 / 3871 / 40
1999 / 3950 / 40
2000 / 4223 / 40
2001 / 3992 / 40
2002 / 4323 / 40
2003 / 4327 / 40
2004 / 4367 / 40
2005 / 4458 / 40
2006 / 4484 / 40
2007 / 4554 / 40
2008 / 4577 / 40
Table 1. Statistics on the number active enterprises in Italy, period 1998-2008.
Given that our goal is to identify the patent applicants, it was considered that it could be useful and efficient to concentrate first on lists of enterprises showing a high research and innovation propensity. To this aim, we took into consideration the survey frame of Research and Development survey which is yearly conducted by Istat. Only 2006 and 2007 waves were available in a standardized form. In the 2006 R & D data file, there are 26237 records, while in the 2007 data file, there are 16730 records. Since ASIA is the sampling frame for any business survey conducted at Istat, the information included in R & D survey frames is similar to the one contained in ASIA. When using the R & D survey frames, we only assumed that the linkage probability would be higher (due to the innovation propensity) for the R & D survey frames than for the entire business register ASIA.
3. Data pre-processing and standardisation
PATSTAT counts 299769 applications identified by an Application Numberand a Publication Number; the latter is redundant information and therefore it was ignored in this work.The number of Italian applications reduces to 72037. To each Application Numberis assigned an applicant name (and id code), and the Zip Code. Additional information may be derived from the previousinformation: year of application,year of first/last application by applicant; number of patent applications filed by each applicant, region of residence of the applicants.
Variable ApplicantName has been subject to the following standardisation operations:
- extraction of the application year, recorded in a new variable “ApplicationYear”
- transformation of all letters in upper case letters
- removal of punctuations
- accents
- symbols and special characters (e.g. ‘$’, ‘%’ , ‘’, ‘/’, ‘*’)
- double spaces (transformed in a single space)
- dots (e.g. L.T.D. transformed in LTD)
- standardisation of known abbreviations (e.g. we found about 150 ways to say “in short”) in an unique value (typically Italian words)
- standardisation of the most frequent words using a deterministic record linkage procedure in Relais, see Istat (2011)
a)input files: we considered a file of words (sequence of characters separated by a blank character) with frequencies greater than 1000 against a file of words with frequencies greater than 100, but smaller than 1000;
b)parameters: comparison function = “Edit distance”; threshold=0.8, greedy algorithm to perform the one-to-one assignment;
c)output check: the word pairs declared “match” were subject to a clerical review;
d)standardization: the 122 pairsdeclared as equivalent were standardized in the same way; they generally concerned singular – plural or Italian – English versions of the same words.
e)examples: TERMOIDRAULICA – TERMOIDRAULICI; SOLUTION – SOLUTIONS; MULTISERVICE – MULTISERVIZI;
- removal of duplicated words in the same name (each second occurrence of the same word was removed).This means that each name is composed by words of frequency 1, e.g.
AAA BB AAA CCCCC was transformed in AAA BB CCCCC
- ordering of words in alphabetical order, e.g. CC BB AA was transformed in AA BB CC
- identification and removal of the legal form, if any. Information on the legal form was stored in a standardized manner in another variable, called Legal Form. About 80 ways of expressing 6 main standardized legal forms were identified. The 6 main legal form categories are ‘SPA’, ‘SRL’,‘SAS’,‘SNC’, ‘COOP’ and ‘NONE’.
The resulting variable is called Standardized Name.
Then, some additional variables have been derived from the standardizedApplicant Name:
a)the Standardised name, without abbreviations, duplications, etc.
b)the standardised Applicant Name, without some very common words, e.g. ITALIA
c)acronyms and abbreviations
d)number of characters
e)longest and shortest words
f)length of the longest/shortest words
Since, in this stage, only enterprises should be subject to any linkage process, universities and known public administrations were eliminated from the file (those records were identified as names containing words like “UNIVERSITY”, “POLITECNICO”, etc.).
Except for standardisation operations 1 and 8, the same pre-processing was applied to ASIA. Operations 1 and 8 are not necessary since ASIA already contains information on year and legal form of enterprises. Additionally, the same unique standard values identified when performing operation 5 on PATSTAT were used also for ASIA.
As comparison variables, in this linkage stage, the only three variables shared by PATSTAT and ASIA are:Standardised Name,Zip Code andLegal Form (stemming from the Applicant Name).
Finally, the PATSTAT data file was deduplicated by considering duplicated those records having simultaneously the same values for the three comparison variables mentioned above. Thus, the number of records reduced from 72037to 23833. It should be noted that records in ASIA are supposed to be unique, each enterprise being assigned an unique identification code, i.e. a key number. This unique identification number allows the enterprise traceability in whatever Istat business survey conducted.
In figure 1, histograms of length of Standardized Name and number of words (sequence of characters separated by blank) in the standardizedPATSTAT database are shown. It may be observed that, in mean, the Standardized Name has a length equal to 15, while the mean number of words in a name equals 2.2.
Figure 1. Distribution of both length of Standardized Name and number of words in a name, PATSTAT database.
In table 2, the distribution of variable Legal Form is shown. It may be observed that for almost 40% of records none legal form was identified, while the majority, about 56%, of records is concentrated in categories “SPA” and “SRL”.