Example of Two-Stage Cluster Sampling

A garment manufacturer has N = 90 plants located throughout the United States and wants to estimate the average number of hours that the sewing machines were down for repairs in the past months. Because the plants are widely scattered, she decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant contains many machines, and checking the repair record for each machine would be time-consuming. Therefore she uses two-stage cluster sampling. Enough time and money are available to sample n = 10 plants and approximately 20% of the machines in each plant. The resulting data are given in the table below.

Plant / Mi / mi / Downtime (in hours)
1 / 50 / 10 / 5, 7, 9, 0, 11, 2, 8, 4, 3, 5 / 5.40 / 11.38
2 / 65 / 13 / 4, 3, 7, 2, 11, 0, 1, 9, 4, 3, 2, 1, 5 / 4.00 / 10.67
3 / 45 / 9 / 5, 6, 4, 11, 12, 0, 1, 8, 4 / 5.67 / 16.75
4 / 48 / 10 / 6, 4, 0, 1, 0, 9, 8, 4, 6, 10 / 4.80 / 13.29
5 / 52 / 10 / 11, 4, 3, 1, 0., 2, 8, 6, 5, 3 / 4.30 / 11.12
6 / 58 / 12 / 12, 11, 3, 4, 2, 0, 0, 1, 4, 3, 2, 4 / 3.83 / 14.88
7 / 42 / 8 / 3, 7, 6, 7, 8, 4, 3, 2 / 5.00 / 5.14
8 / 66 / 13 / 3, 6, 4, 3, 2, 2, 8, 4, 0, 4, 5, 6, 3 / 3.85 / 4.31
9 / 40 / 8 / 6, 4, 7, 3, 9, 1, 4, 5 / 4.88 / 6.13
10 / 56 / 11 / 6, 7, 5, 10, 11, 2, 1, 4, 0, 5, 4 / 5.00 / 11.80

We want to estimate the average downtime per machine, and we know that the total number of machines in all plants is K = 4500.

The ANOVA table is given below:

Source / Df / Sum of Squares / Mean Square / F / p
Model / 9 / 40.704487 / 4.522721 / 0.43 / 0.9177
Error / 94 / 996.333974 / 10.599298
Total / 103 / 1037.038462 / 10.068335

We have , indicating that each cluster is relatively heterogeneous; thus cluster sampling is at least as efficient as simple random sampling.

Unbiased estimation:

An unbiased estimate of the population mean is hours.

The estimated variance is given by

The two terms in the variance estimate will be calculated separately. We have

, The mean cluster size is estimated to be , so that

= 892.3468481, and .

Then the estimated variance is , and the standard error is .

A 95% C. I. estimate of the average downtime is .

Ratio estimation:

The ratio estimator of the population mean is hours.

Then the estimated variance of the estimator is , where

. Then the estimated variance is 0.1299795637,

and a 95% C. I. estimate of the average downtime is .

In this case, ratio estimation gives less precision than unbiased estimation. Let’s examine the reason. The scatterplot of cluster totals v. cluster sizes is shown below. It is clear that the line of best fit to the data does not pass through the origin. In addition, the regression of cluster totals v. cluster sizes shows a rather weak correlation, 0.5488. There is obviously not a strong relationship between cluster totals and cluster sizes.

SUMMARY OUTPUT
Regression Statistics
Multiple R / 0.548818092
R Square / 0.301201298
Adjusted R Square / 0.213851461
Standard Error / 5.01216653
Observations / 10