Son Preference in Indian Families, S. Gaudin
SUPPLEMENT 2
STABILITY TESTS FOR SMALL SAMPLE PCA
Although principal components analysis (PCA) is widely recognized as a means to create wealth indices from survey data, the analysis is normally performed using large sample sizes. When the comparison base is a community, the number of households interviewed in the same community may be small, even if it constitutes a random sample for that community as is the case here for Primary Sampling Units (PSUs). The literature does not provide information about a minimum sample size necessary to correctly rank households using PCA. In order to evaluate whether the method correctly ranks households in smaller samples, I ran simulation tests using all PSU’s with sample size N≥55 (63 PSUs in NFHS-2 and 99 in NFHS-3). The following procedure was followed:
i.PCA is performed for the full size PSUs as in the main analysis; results are recorded in PCALL for scores and PCQALL for quintiles.
ii.Numbers from 1 to N/5 are randomly assigned to household in each quintile (regardless of how they ranked in score); samples are reduced to n=50, 40, 35, 30, 25, 20, 15, 10, and 5 households by keeping households numbered 1 through n/5 for each value of n.
iii.The same PCA is run for each sample size; scores are recorded in PCn and quintiles in PCQn.
Using this procedure for all PSU’s, ten different wealth scores and quintiles based on sample sizes from the full N≥55 down to five households (one per original full-sample quintile) are obtained. Because household selection in each quintile makes a difference in terms of resulting PC scores, the procedure is repeated k=50 time, each time with a new randomization of the ordering of households in each quintile (adding more runs did not change summary statistics of correlation coefficient). Principal components scores thus obtained are recorded in PCnk and PCQnk, k=1,2,…50. Correlation coefficients between PCALL and PCnk (snk) and between PCQALL and PCQnk (qnk), k=1,2,… 50, are calculated for all n, based on the five households per PSU who obtained the number 1 in the random ranking in the kth run and recorded in the random variables qn and qn.
Table S7 gives summary statistics qn and qn by sample size n. Correlations of quintiles are slightly lower than correlations of scores due to threshold effects but correlations with the full sample results still average above 90 for all n>20 and above 80 for n>10. Results are very similar across the two samples. The median correlation is virtually identical to the mean in all cases; Figure S1 represents the gradual change in median correlations between full-sample and reduced-sample scores and quintiles as sample sizes are reduced. The tests reveal that the principal components procedure is relatively stable to the number of households in the sample. Correlations decrease with the size of the sample but there are no obvious breaks, down to samples sizes of 10. The rate at which deterioration occurs increases slightly when sample sizes get below 25 (although the range of y-values chosen emphasizes the magnitude of the deterioration.)
Table S7.Summary statistics on correlations between full-sized and reduced-sample principal components scores from 50 independent randomized household selections.
NFHS2 (based on 63 PSUs) / NFHS3(based on 99 PSUs)Sample sizes (n) / Min / Max / Mean / St. dev. / Min / Max / Mean / St. dev.
50 (10 per quintile) / .96 / 1 / .99 / .01 / .96 / 1 / .99 / .01
40 / .92 / .99 / .98 / .02 / .95 / .99 / .98 / .01
35 / .92 / .99 / .98 / .02 / .92 / .99 / .97 / .01
30 / .87 / .99 / .97 / .02 / .87 / .98 / .96 / .02
25 / .86 / .98 / .96 / .02 / .91 / .98 / .95 / .02
20 / .85 / .97 / .94 / .02 / .86 / .97 / .93 / .02
15 / .84 / .96 / .92 / .03 / .85 / .95 / .91 / .03
10 / .77 / .94 / .90 / .03 / .80 / .93 / .89 / .03
5 / .78 / .91 / .85 / .03 / .76 / .89 / .83 / .03
Note: Correlations for different size samples are calculated using the 5 observations per PSU with calculated principal components scores at all levels, so 315 observations in NFHS-2 and 495 in NFHS-3 are used to calculate the correlations. The number of observations is the same across the 50 runs but the households are different.
Fig. S1. Median correlations between reduced- and full-sample principal components results