1

Y. Dikhanov, Decomposition of Inequality Based on Incomplete Information

Draft

Comments appreciated

Decomposition of Inequality Based on Incomplete Information

A contributed paper to the IARIW 24th General Conference

Lillehammer, Norway, August 18-24, 1996

Yuri Dikhanov

Statistical Advisory Services

International Economics Department, IECDD

The World Bank

1818 H Street, N.W.

Room N2-038

Washington, D.C. 20433 U.S.A.

phone:(202)458-2667

fax: (202)522-3669

e-mail:

Abstract

In this paper, the author examines five measures of inequality: the Gini coefficient, two entropy (Theil) indexes, normalized variance and decile ratio. It is shown how to decompose these indexes into intra-group and between-group inequalities. These indexes are used to study inequalities in the former Soviet republics in 1990. This study is based on incomplete information on income intervals (only income boundaries and population shares have been used). The robustness of the approximating procedure (piecewise polynomial interpolation of the cumulative distribution function) is discussed. Two alternative representations of the Gini coefficient are discussed as well.

The views presented in this paper are the author’s and do not necessarily represent those of the World Bank or its Board of Directors.

I. Introduction

Analysis of income or wealth distribution often includes decomposing inequality for total population into between-group and within-group inequalities. Not all inequality measures are decomposable, and not all of the decomposable ones are decomposable in the same way. Theoretically, the second Theil measure (T2) has probably the best properties. The Gini index, however, is the most widely cited measure. In this paper we made an attempt to decompose the Gini index in a meaningful way (see Section IV).

The Gini index, along with the two Theil measures, normalized variance and decile coefficient, was then used to analyze income inequalities in the former Soviet Union and its Republics in 1990 (see Section II and Annex). We found that the share of inter-group inequality in total inequality was in the range of 7.7-15.8 percent, depending on the index. As inputs into this exercise, we used official data on intervals: seven interval boundaries and population shares within these boundaries for each of the former Soviet Republics.

To process these discrete data we used interpolation with polynomial of order four on each interval. These polynomials are chosen to be twice continuously differentiable in all points of the distribution, which allows differential and integral operations with a distribution functionand its derivatives in explicit form. Section III discusses the robustness of these procedure using two numerical examples: a “bad” one, a hypothetical mixture of two normal distributions with different means and variances, presented as five income intervals (quintiles); and a “good” one, a log-normal distribution, presented as ten intervals. As expected, in the “good” case the precision of approximation is by one or two order of magnitude better than in the “bad” case (0.004-0.39 percent depending on the parameter versus 0.2 - 1.3 percent).

Section V discusses two alternative graphical and analytical representations of the Gini coefficients that are based on the original distribution function rather than on the Lorenz curve.

II. Decomposition of income inequalities in the former Soviet Union.

There were two major reasons we used the former Soviet Union data from 1990: first, the data were available (there were not many countries where regional inequality data were collected on the regular and comparable basis); and, second, since 1990 the former Soviet Republics have become independent countries, and as economies in transition, they attract the special attention of academics and policy makers.

Original information included boundaries and population shares for seven intervals (see Table 1 below). To process this information we used a version of our Gini ToolPak.

Table 1. Original data on income distribution shares in the former Soviet Union for 1990

The overall results can be assessed from Figure 0-2 from the Annex that represents normalized values of various inequality measures (inequality indexes normalized by their standard deviations). As we can see, the lowest inequality was observed in Belarus and Ukraine, followed by Estonia, Latvia and Lithuania. That the Baltics had higher inequality than Belarus and Ukraine has to be attributed to the fact that, although minimum wages were the same in all of these former republics, the means were higher in the Baltics. Russia had a higher income inequality than these economies, which is to be expected given her size. A factor that additionally increased the inequality for Russia was the relatively high prices (and hence wages) in Siberia. The highest inequality was registered for Azerbaijan and the Central Asian states (Uzbekistan, Kyrgyzstan, Turkmenistan and Tajikistan). The results for Azerbaijan are not obvious given the much lower numbers for neighboring Armenia and Georgia.

Another piece of information that Figure 2 provides relates to the correlation between the indexes. We can see that, in general, all the indexes for this set of countries produce highly consistent results. Table 2 of the Annex provides correlation coefficients. As we can see, one of the highest values of correlation coefficients is observed for the Theil1-Theil2 pair: r2=0.9964. By absolute value, the difference between them is around 2 percent, which can be seen as a measure of the deviation of the actual distribution from the log-normal one (as we know, under the assumption of log-normal distribution, the two Theil indexes coincide). As we can see, for some economies the deviation between the two Theil indexes is insignificant: 0.1-0.2 percent - though a part of that can be attributed to the fact that the approximation errors go in opposite directions. The two Theil indexes and the Gini coefficient are correlated even tighter: r2=0.9979-0.9987. Also a very high correlation was registered for the Theil1-Decile ratio pair: r2=0.9980. Tight correlation is also observed for the Theil 2 - Decile ratio pair: r2=0.9932. The lowest value of correlation coefficient is registered for the Variance-Decile ratio pair: r2=0.9908. We have to say, though, that this value is still very significant. The overall conclusion is that all these inequality measures produce coherent results.

Table 1 of the Annex provides the results of actual estimations. Shares of inter-group variance presented in the table are of special significance for this paper. The two Theil indexes and variance display similar results in the 12.9-15.8 percent range. The share of inter-group variances for the Gini coefficient, on the other hand, is only 7.7 percent, which is roughly half of those for other measures. One has to bear in mind, however, that the ways these indexes decompose are different, and, thus, are not directly comparable. The two Theil indexes, for example, produce identical results only under the assumption of “log-normality” of the distribution. However, shares of inter-group variances will still be different because they are aggregated with different weights (income and population shares, respectively). The inter-group results produced using the second Theil index (0.0170) can be compared to those estimated by H. Theil (1989, Development of International Inequality, Journal of Econometrics, Vol. 42, No. 1, North-Holland). For 1985, he found the inequality between the OECD countries (without Australia) to be 0.0859; for tropical America, 0.0580; for tropical Asia, 0.2003; and for tropical Africa, 0.1871. Figure 5 of the Annex provides a graphical representation of the Theil index for combined distribution versus the between-group Theil index.

Figure 3 of the Annex presents density functions of income distribution in the former republics. It is interesting to note that the Estonian distribution has slight irregularities in the upper part of the distribution. This might indicate urban/rural or Tallinn/rest of the country income differentials[1]. More likely, a factor that might have contributed to that situation was the advance of reforms in Estonia: in 1990 this country had the highest share of non-agricultural private sector in the former Soviet Union, which provided much higher salaries than the state sector.

Figure 4 of the Annex is a histogram on a logarithmic scale. It shows shares of population within proportional boundaries (the next boundary is in proportion to the previous one). It has to be noted that in this case the highest point would not be the mode as in a distribution density function, but the mean. Using this type of histogram requires, however, some compliance with the assumption of “log-normalness” of the distribution.

Table 2 of the Annex presents income shares by decile. That Azerbaijan had the highest inequality and Belarus and Ukraine had the lowest can be directly inferred from this table.

III. Robustness of the computational procedure

For this exercise the Gini Toolpak was used. In this section we will briefly explore the issues of robustness of the procedure. We will use two numerical examples: a “good” case (ten income intervals for log-normal distribution); and a “bad” case (five intervals, i.e., quintile data; for a mixture of two normal distributions with different means and variances).

The essence of the procedure (polynomial interpolation) is the following:

Let’s assume that we are given only a set {F(Yi)} of M elements which describes values that the cumulative distribution function takes at Yi. We need to approximate all other points of the distribution, i.e., to estimate F(y) for y[0,+]. Within each interval [Yi+1 ,Yi], we will interpolate the distribution function by a polynomial of the order 4 in the form:

At the boundaries the polynomials are exact, and are not interpolations: i.e., .

These polynomials are chosen to be twice continuously differentiable across the boundaries. This is a very important property, because it allows differential and integral operations with F and its derivatives in explicit form. For example, the mean of the distribution would be calculated as follows: , where M is the number of intervals. Other characteristics of the distribution function can be derived in a similar way.

Errors of estimation in polynomial interpolation

Using logic similar to that behind the remainder term of Taylor formula in Lagrange form, we arrive at the following expression for estimation errors[2]:

In the case of normal (standard) distribution the above boils down to:

Or, in the case when the intervals are separated by /2, we obtain that the biggest errors will be in the interval [0.5, ] (that can be seen from the first order condition for ), and the errors in this interval are expressed as follows:

A. “Good” case

As a “good” case, we used ten income intervals for the log-normal distribution LN(5,0.25).

The results are presented in the table below. Graphical results are presented in Figure 1.

As can be seen from the graph, the actual distribution cannot be readily distinguished from the simulation. The largest difference is for the mode, which is notoriously difficult to get.

Actual
values / Simulation / Difference
Mean Income / 153.12 / 153.09 / -0.02%
Gini-coefficient / 0.14032 / 0.14023 / -0.06%
Median Income / 148.41 / 148.41 / 0.00%
Mode Income / 139.42 / 139.97 / 0.39%
Variance / 38.887 / 38.923 / 0.09%
Income less than mean / 0.5497 / 0.5494 / -0.06%
Theil index / 0.03125 / 0.03123 / -0.07%
Theil index 2 / 0.03125 / 0.03126 / 0.03%

Figure 1. Deviation of simulation from actual distribution: a "good" case

B. “Bad” case

As a “bad” case, we used five income intervals for the mixture of two normal distributions N(40,10) and N(60,5). The results are presented in the tables below. Graphical results are presented in Figure 2. As can be seen from the graph, the actual distribution is visually readily distinguishable from the simulation. The largest difference is again for the mode.

Inputs into the procedure

Interval boundaries / Quintiles of population
< 37.4696 / Quintile I
37.4696 to 48.10972 / Quintile II
48.10972 to 56.60144 / Quintile III
56.60144 to 61.47081 / Quintile IV
> 61.47081 / Quintile V

Results of the simulation

Actual values / Simulation / Difference
Mean Income / 50.00 / 49.67 / -0.7%
Median Income / 53.33 / 53.23 / -0.2%
Mode Income / 59.64 / 58.87 / -1.3%
Income less than mean / 43.20% / 42.82% / -0.9%

Figure 2. Deviation of simulation from actual distribution: "bad" case

IV. Decomposition of inequality measures

IV.1. Decomposition of GINI - coefficient

Let’s consider a distribution F defined by its cumulative distribution function F(y). The respective distribution density function is F. The mean of that distribution is defined as using Lebesgue-Stiltjes integrals. (Hereinafter a plain integral sign describes integrating from 0 to +). Then the essence of the Gini - coefficient can be seen from the graph of the Lorenz curve (see Figure 3).

Figure 3. Lorenz curve

Gini-coefficient is defined as equal to twice the area between the 45 line and Lorenz curve. Or

Let’s consider two distributions F1 and F2, where the distributions are defined by their respective cumulative distribution functions Fi(y). The respective distribution density functions are Fi. Means are defined as . Thus, we can define Gini - coefficients G for the respective functions as follows:

(1)

Then, for the combined distribution we can write:

(2)

where:

- income share of the i distribution

pi - population share

- mean income for the combined distribution

Or, after some simple operations we will receive:

(3)

Expression (3) is obtained as follows:

:

It is easy to see how the above expression can be expanded for a multi-component case:

The above expression can be rewritten as follows:

(4)

And, as it is easy to see how the Gini-coefficient can be expressed through the covariance as well:

and the combined Gini-coefficient can be written as:

(5)

Or,

The first component stands for intra-group covariances, whereas the second stands for inter-group covariance.

As we can see from expression (3), the Gini - coefficient for the combined distribution consists of two parts: intra-group and inter-group variances. Similar to the Theil coefficient T1, the individual Gini - coefficients are added up with income weights.

IV.2. Decomposition of entropy (Theil) indexes

In his book, H. Theil (1967, Economics and Information Theory, North-Holland, Amsterdam), introduced, for income inequality measurement, the entropy measure used in thermodynamics and information theory. He suggested using the entropy index in two forms: as income-weighted and population-weighted entropy indexes. In this paper we will call them T1 and T2 respectively.

These indexes can be represented as follows:

where,

Yi is income of group i;

Ni is number of people in group i

Or, using Lebesgue-Stiltjes integrals:

As can be shown, these indexes are easily decomposable in the multi-group case. For the Theil index T1 we have:

where:

Yijis income of sub-group j of group i;

Nijis number of people in sub-group j of group i;

Yi is income of group i;

Ni is number of people in group i

Or, using Lebesgue-Stiltjes integrals:

The Theil index T1 decomposes into:

T2 decomposes in a similar way with the population weights p.

As has been shown by F. Bourguignon (1979, Decomposable Income Inequality Measures, Econometrica, Vol. 47, No. 4.), and A.F.Shorrocks (1980, Inequality Measures, Econometrica, Vol. 48, No 3), the Theil indexes are the only income-weighted and population-weighted indexes respectively that can be decomposed in that way: i. e., weighted sum of individual Theil indexes and the Theil index constructed of individual distributions as if they were elements of the combined distribution. In this sense, the decomposition of the Theil indexes is different from that of the Gini.

IV.3. Decomposition of normalized variance

Normalized variance can be seen as a simple way of describing income inequalities.

Or,

IV.4. Decomposition of decile ratio

Decile ratio is a simple and transparent inequality measure, however it cannot be meaningfully decomposed.

IV.5. Lorenz curve

The Lorenz function L is the function of income shares on population shares. The Lorenz curve associated with this function is plotted in Figure 3. The Lorenz curve plays an enormous role in income distribution analysis. Some important relationships between the Lorenz curve and the cumulative distribution function, as well as a graphical representation of the Theil index, are shown below.

Figure 4. First derivative of the Lorenz curve

Figure 4 shows the first derivative of the Lorenz curve, L(F). It can be easily seen that L(F) is essentially the normalized income y/, and, thus, is the inverse (normalized) of the cumulative distribution function. The graph is also related to the Theil (T2) index. The logarithm of this graph is a graphical representation of the index (because the index can be presented as .

Figure 5. Graphical representation of the Theil index (T2)

The second derivative of the Lorenz curve is also an important characteristic of a distribution: L(F)=yF/.. It is essentially the expression for the inverse function of a distribution density function F(y).

Figure 6. Second derivative of the Lorenz curve

IV.6. Some properties of log-normal distribution

Log-normal distribution plays an important role in inequality measurements. It is thought that real distributions of wealth and income at least partially can be approximated by it. An extensive treatment of the log-normal distribution is contained in J. Aitchison and J.Brown (1957, The Lognormal Distribution, Cambridge University Press). Here we mention just a few relevant properties.

A convenient feature of the log-normal distribution is the simplicity of the Gini and Theil indexes:

And, in the case of the second Theil index, we can obtain the following expression:

We can use the test of T1=T2 to examine how close a given distribution approaches a log-normal one.

The relationship of the Theil measures to normalized variance can be expressed as follows:

In the log-normal case, we can also think of the Theil indexes as the difference between the mean and median.

And, finally, as can be easily seen, the Gini coefficient for the log-normal distribution can be written as follows: