The Standardized ILO October Inquiry 1983-2003

Remco H. Oostendorp

Free University Amsterdam

Tinbergen Institute

Amsterdam Institute for International development

September 12, 2005Introduction

This document describes the standardization procedure for the 1983-2003 ILO October Inquiry data. This procedure is an improved version of the procedure that has been applied to the 1983-1998 ILO October Inquiry as described in Freeman and Oostendorp (2000). The main improvements are twofold, namely an improved cleaning procedure and the use of country-specific data type correction factors. First we discuss the main characteristics of the 1983-2003 ILO October Inquiry database, and next we give a detailed account of each of the steps of the standardization procedure.

The 1983-2003 ILO October Inquiry data

Table 1 reports the number of countries that report pay data for each year and all years cumulatively for at least one of the 161 occupations.[1] The number of countries that report pay data for at least one occupation varies between 42 and 76 countries in the years 1983-2002. The number of countries reporting for 2003 is lower at 26 because the data are preliminary with some countries expected to report later. Nevertheless the average number of countries reporting for the periods 1983-1989, 1990-1999, and 2000-2002 is respectively 67.7, 65.8 and 47.0 – in other words, reporting compliance appears to be falling.

Table 1. Number of countries reporting in the 1983-2003 ILO October Inquiry

Number of Countries Reporting

Year / In Given Year / Cumulative Reporting
1983 / 58 / 58
1984 / 60 / 75
1985 / 71 / 96
1986 / 67 / 105
1987 / 76 / 115
1988 / 73 / 121
1989 / 69 / 124
1990 / 75 / 129
1991 / 69 / 130
1992 / 61 / 132
1993 / 67 / 140
1994 / 62 / 145
1995 / 76 / 151
1996 / 65 / 157
1997 / 67 / 158
1998 / 56 / 158
1999 / 60 / 158
2000 / 56 / 159
2001 / 42 / 160
2002 / 43 / 163
2003 / 26 / 163

Table 2 gives a detailed description of the information in 1983-2003 ILO October Inquiry. It should be noted that the reported numbers are only for observations from sources that the ILO classifies as of acceptable/good or excellent quality (see Freeman and Oostendorp 2001). Also the numbers reflect only those observations that were retained after extensive cleaning (the data cleaning procedure is described below).

Panel A gives information on the size of the sample. It shows the maximum conceivable number of observations that the Inquiry would contain if each country reported a single wage statistic for each occupation yearly: over 400,000 pieces of data.[2] The actual number of observations is smaller, largely because most countries do not report statistics in many years. On average, countries report wages for 8.5 years out of 21 possible years. Even if we ignore 2003, which has preliminary data for only 26 countries, more than half of country year observations are empty. In addition, countries do not report data for every occupation in the years when they do report. The bottom line is that there are 90,772 country x year x occupation cells with wage data in the 1983-2003 file.

However, many countries report more than one wage for a single occupation. Some give hourly wage rates and average earnings. Others give wages for men and wages for women. Others give wages for one gender and for both genders. Nearly half of the observations (47.4%) contain multiple wage figures. While this will help us to calibrate the data into a standardized format, it makes the raw data difficult to use in cross-country comparisons, particularly since different countries report pay differently. Including multiple wages, there are 157,021 pieces of data.

Panel B shows the frequency distribution of countries by the number of occupations they report; and the frequency distribution of occupations by the number of countries that report statistics on them. The distribution of countries by number of occupations shows that in most countries there are sufficient occupations with wage data to get a good measure of the overall wage structure. It also shows, however, that different countries report on different numbers of occupations, which creates problems in comparing wage structures across countries. The distribution of occupations by country shows that many occupations have wage data for large numbers of countries, which will allow us to contrast labor costs and living standards for workers in the same occupation around the world.

Panel C shows the diverse way in which countries report wages. Most countries report wage rates, presumably from employer surveys or collective bargaining contracts or

Table 2. Types of observations contained in the October Inquiry computer files, 1983-2003

Number
A. SAMPLE SIZE
Maximum conceivable observations / 463,197
Observations missing because country did not report in given year / 275,954
Observations missing because occupation missing in year country reported / 96,471
Actual year/country/occupation observation / 90,772
Observations with multiple figures / 42,988
Multiple figures / 66,249
Total, including all multiple observations / 157,021
B. COUNTRIES AND OCCUPATIONS WITH AT LEAST ONE REPORTED WAGE STATISTIC
Countries with reported wage statistic for different numbers of occupations
No. of occupations / No. of countries (totaling 137)
<30 / 7
30-59 / 16
60-79 / 15
80-99 / 13
100-119 / 30
120-139 / 22
140+ / 34
Occupations with one reported wage statistic for different numbers of countries
No. of countries reporting on occupations / No. of occupations (totaling 161)
<59 / 18
60-79 / 35
80-99 / 45
100-119 / 52
120+ / 11
C. ACTUAL OBSERVATIONS
Pay concept
Wage rates (142 countries) / 99,050
Earnings (95 countries) / 57,971
Averaging concept
Mean / 110298
Minimum / 33012
Maximum / 5300
Average of min-max / 192
Prevailing / 4654
Median / 2961
Other / 15
Missing / 589
Period concept
Monthly / 102202
Hourly1 / 27756
Weekly / 15462
Daily / 8386
Annual / 2737
Fortnight / 457
Other / 20
Missing / 1
Sex
Male workers / 63,434
Male and female workers / 59,354
Female workers / 34,233
Coverage
Whole country / 131,872
Part of country / 25,149

Source: Tabulated from ILO October Inquiry computer files, 1983-2003. The total number of countries is 137 (compare with the 163 countries reporting in table 1) because the reported numbers are only for observations from sources that the ILO classifies as of acceptable/good or excellent quality (see Freeman and Oostendorp 2001). Note. The hourly figures include a small number of observations which concern hours paid for, and another small number which concern wages relating to hours worked.

legislated pay schedules. However, many report earnings, which may come from household surveys. Most give statistics in the form of averages[3] but 21 percent report minimum wages,

some from collective bargaining contracts. Some countries report maximum wages. Others give prevailing wages. The US reports median weekly earnings for most occupations (from individual reports on the Current Population Survey). The time period to which the pay refers also varies. The most common period is the month, followed by the hour, but some countries report weekly pay, others give daily rates for some occupations, and so on. There is also variation by gender. Forty percent of the observations relate to male workers, 38 percent to all workers, and 22 percent to female workers.

Combining all of these different variants, the vast majority of the Inquiry statistics are non-comparable. Just 7.8 percent relate simultaneously to the most common pay concept (wages), use the most common averaging (mean), cover the most common time span (monthly), relate to the most common gender (male) and cover the whole country.[4]

Standardization procedure for the 1983-2003 ILO October Inquiry data

Because of the nonstandard nature of the database we use a standardization procedure to make the data comparable across occupations, countries and time. This procedure is an improved version of the standardization procedure as developed and described by Freeman and Oostendorp (2000, 2001), with an improved cleaning procedure and the application of country-specific data type correction factors. The remainder of this section provides a detailed account of each of steps in the standardization procedure.

Step 1. Data cleaning

Data cleaning was done through the inspection of data plots, with wages (in local currency units) on the vertical axis and year of reporting on the horizontal axis for each country x occupation pair.[5] Although we cleaned all data, here we describe the outcome of the cleaning procedure only for observations from sources that the ILO classifies as of acceptable/good or excellent quality. The following figure shows one example of a plot for the occupation “Quarryman” for Belgium.

The plot clearly shows that the reported wage observation for 1985 is an outlier. Further inspection of the raw data showed that this problem was not caused by variation in the averaging concept (e.g. maximum wage reported instead of minimum wage), (obvious) miscoding in the period concept (e.g. annual wages reported instead of monthly wages),gender wage differences, or differences in location (i.e. regions within the country) from which wages were reported. In this case it was therefore decided to drop the above outlier in order to preserve the obvious time pattern in the data.

The following figure shows a similar plot for the occupation “Deep-sea fisherman”

for Suriname:

Because wages were reported with different averaging units, minimum wages were plotted by squares, maximum wages by pluses, and average wages by a triangle. The plot clearly shows that the reported wages for 1990 are outliers, and further inspection of the raw data showed that this was due to an obvious miscoding of the period concept for 1990 (monthly wages were reported instead of weekly wages). Therefore the period concept was recoded to monthly wages for 1990 and the obvious time pattern was restored.

In total 14,402 country x occupation pairs were inspected. For a few country x occupation pairs it was obvious that there was no logical time pattern in the reported data and all the observations of the country x occupation pair were deleted from the database. The following plot for “Motor bus driver” for Puerto Rico illustrates this case. Here only minimum wages are reported for the period 1985-1995, but both minimum wages (marked by squares) and maximum wages (marked by pluses) are reported for 1996-2002. However, there is no logical time pattern in the reported minimum wage, although one may speculate that the reported wages for the period 1985-1995 are actually maximum wages. This is not obvious, however, and it was therefore decided to drop all the observations from this country x occupation pair.

After inspecting wages across years for each country x occupation pair, two further data checks were performed. First, wages were inspected across occupations for each country x year pair. In case occupational wages were a tenfold smaller or a tenfold larger than the average occupational wage within a country x year pair, then this wage observation was further checked and corrected if necessary. Second, the average wage within a country x year pair was compared to GDP per capita for that country and year. In case the ratio of the average wage and GDP per capita was very low or high (in the lower or upper 1% of the distribution) , then these wage observations were further checked and corrected if necessary.

In total 1082 wage observations were removed and 6210 wage observations were recoded, or 0.69% and 3.95% of the total respectively. Hence in total 4.64% of all wage observations were affected by the cleaning procedure.

Step 2. Data construction

Before estimating the data type correction factors (see Freeman and Oostendorp 2000, 2001 for a discussion of the methodology for estimating data correction factors), a number of steps were taken to standardize the data as much as possible.

Recoding of footnotes

Many wage observations are reported with footnotes, such as “Average per hour”, “Auckland”, “Both sexes" or “Large hotels”. Each of these footnotes has been coded as much as possible, using the variables as listed in the following table. The part of the footnote that could not be coded within y0-y10 was retained within the string variable ftn.

Table 3. Variables in raw data set
Name / Description
y0 / year
y1 / country code
y2 / city or region code
y3 / industry code
y4 / occupation code
y6 / pay or hours of work concept code
y7 / sex code
y8 / range code
y9 / period concept code
y10 / averaging concept code
ftn / footnote
x / data
Range data

In a few cases the wages are in the form of ranges. We found the midpoint of the range and use it as the wage for the category.

Minimum-maximum data

In case a pair of wages have been reported as ‘minimum-maximum wage’, the smallest has been recoded as ‘minimum wage’ and the largest as ‘maximum wage’.

Construction of monthly data

The wage observations have been recalculated on a monthly basis as much as possible, using dimensional analysis[6] and the reported hours of work. The number of hours was not always reported for the given occupation (y4), pay concept (y6), sex (y7), year (y0) and city/region (y2). In this case the next best alternative hours of work was assigned. The following table reports the different hours of work data that have been used successively in lexicographic order.

Table 4. Lexicographic assignment of hours of work
Lexicographic order / Hours of work assigned from
1 / same occupation (y4), pay concept (y6), sex (y7), year (y0) and city/region (y2)
2 / other sex (y7) for given occupation (y4), pay concept (y6), year (y0) and city/region (y2)
3 / other city/region (y2) for given occupation (y4), pay concept (y6), sex (y7) and year (y0)
4 / other year (closest) (y0) for given occupation (y4), pay concept (y6), sex (y7), year (y0) and city/region (y2)
5 / other pay concept (y6) for given occupation (y4), sex (y7), year (y0) and city/region (y2)
6 / other occupation (average) (y4) for given pay concept (y6), sex (y7), year (y0) and city/region (y2)

The above lexicographic ordering has been chosen because the variation in hours of work can be attributed in increasing order of magnitude to variation in sex (y7), city/region (y2), year (y0), pay concept (y6) and occupation (y4). In the few remaining cases where the above lexicographic assignment did not yield an estimate of hours of work, the country-average or, if not available, the world-average of hours of work was used.

Removing idiosyncratic data types

A number of wage observations were removed from the data set because their exact data type was unspecified or too idiosyncratic. For the period concept, wage observations with a missing or ‘other’ period concept (such as per shift, per piece) were dropped. For the averaging concept, wage observations with missing or ‘other’ (unspecified) averaging concept were also dropped. We also removed the wage observations that were reported as the average of minimum and maximum wages because there are only few of them (see table 2) and it is not a common averaging concept. We only included wage observations with the median averaging concept for the US in the data set, because most of the reported median wages are from the US (about 75%) and inclusion of median wages for the other countries led to an implausible estimate of the data correction factor for median wages.[7]

As a result of the above data construction steps, we have two pay concepts (62.30% wage rates, 37.70% earnings), five averaging concepts (71.71% mean, 20.04% minimum, 3.19% maximum, 3.12% prevailing, 1.94% median), 2 period concepts (94.73% monthly, 5.27% daily), and 3 sex concepts (41.73% male, 36.38% male and female, 21.90% female). In the following section we discuss the procedure to standardize these data with these different concepts.

Step 3. Estimation of data type correction factors

The next step is to estimate data correction factors for the 1983-2003 ILO October Inquiry data following the procedure discussed in Freeman and Oostendorp (2000). Note first that because data concepts can occur in combination with each other, this gives potentially 60 data correction factors: 2 types of pay concepts, 5 types of averaging concepts, 2 types of period concepts, and 3 types of sex concepts (2 x 5 x 2 x 3 = 60). Also the impact on wages of each of these (combinations of) data concepts could vary across countries (and even across regions within countries), occupations, and years. Hence, there are a large number of potential correction factors that need to be estimated.

This problem of heterogeneity of the data correction factors was discussed in Freeman and Oostendorp (2000). It was noted that the variation in the October Inquiry is too ‘thin’ to estimate all potential data correction factors for all data types and that it is necessary to simplify the procedure. Also here we will assume that the different data types affect wages separately rather than interactively (reducing the number of combinations of data concepts from 60 to 12). Also we will not estimate data correction factors that vary across time or occupation, assuming for instance that the gender wage gap is constant across time and occupations within a country.[8] The reason we do make these simplifying assumptions is that we think that the largest source of variation in the data correction factors can be found across countries rather than across time or occupations.

In Freeman and Oostendorp (2000, 2001) data correction factors were estimated that varied by region or income rank of the country but not by country. Here we go one step further and estimate country-specific data correction factors as much as possible. Because data correction factors turn out to be highly variable across countries, the proposed standardization procedure can be seen as an important refinement of the earlier procedures. An additional refinement is that we also use country-specific occupational dummies to estimate the standard wage and hence we no longer assume that the occupational wage structure or ranking is similar across countries.

A number of issues need to be addressed when estimating country-specific data correction factors. First, some or none of the data correction factors can be estimated because they are not identified for lack of variation in the data at the country level. If wages in one country are only reported as minimum wages, then it will not be possible to estimate the average wage in this country. Or if average wages are only reported for female workers, and prevailing wages for male workers, then it is not possible to identify the data correction factors for the averaging and sex concept. Second, there might be variation in the data but some of the data types have been sparsely reported. For instance in some countries wages are mostly reported as minimum wages and only in a few instances as average wages. Third, the estimated data correction factor may be implausible. If wages have been reported as daily wages in some instances, and if the estimated data correction factor for the period concept implies that daily wages are forty times less than monthly wages, then this is implausible as the maximum number of working days per month is 31 at most.[9]