Statistics 475 Notes 12
Reading: Lohr, Chapters 5.2-5.3, 5.6
I. Review
Cluster Sampling: we first divide the units in the population into clusters and take a probability sample of clusters and then take a sample of the units in the chosen clusters.
In the simplest type of cluster sampling, one stage cluster sampling, our sample consists of all units in the chosen clusters.
Cluster sampling is often a good design under the following conditions:
- A good frame listing population units either is not available or is very costly to obtain, but a frame listing clusters is easily obtained.
- The cost of obtaining observations increases as the distance separating the units increases.
II. One stage cluster sampling
Example: A sociologist wants to estimate the per-capita income in a certain small city. There are a total of 2500 residents in the city. However, no list of residents adults is available. Cluster sampling is a logical choice for the survey design because no list of units is available. The city is marked off into rectangular blocks, except for two industrial areas and three parks that contain only a few houses. The sociologist decides that each of the city blocks will be considered one cluster, the two industrial areas will be considered one cluster and finally the three parks will be considered one cluster. The clusters are numbered on a city map, with the numbers from 1 to 415. The experimenter has enough time and money to sample 25 clusters and to interview every household within each cluster. Hence, 25 numbers between 1 and 415 are selected and the clusters having these numbers are marked on the map. Interviewers are then assigned to each of the sampled clusters.
Sample data:
Cluster / Number of residents / Total income in cluster / Cluster / Number of residents / Total income in cluster1 / 8 / $323,976 / 14 / 10 / 569,351
2 / 12 / 585,648 / 15 / 9 / 384,167
3 / 4 / 160,480 / 16 / 3 / 67,993
4 / 5 / 174,957 / 17 / 6 / 213,193
5 / 6 / 389,502 / 18 / 5 / 109,360
6 / 6 / 229,814 / 19 / 5 / 201,934
7 / 7 / 390,556 / 20 / 4 / 138,059
8 / 5 / 239,585 / 21 / 6 / 151,352
9 / 8 / 390,675 / 22 / 8 / 271,575
10 / 3 / 111,801 / 23 / 7 / 351,824
11 / 2 / 86,053 / 24 / 3 / 123,520
12 / 6 / 221,220 / 25 / 8 / 302,499
13 / 5 / 99,590 / Sum / 151 / 6,288,684
Notation:
number of psu’s (clusters) in the population
number of units (ultimate sampling units) in cluster i
=jth unit in ith cluster
Total in cluster i:
Population total:
Population variance of cluster totals:
number of clusters sampled
Estimators of population mean and population total:
Option #1: Unbiased Estimates
An unbiased estimate of the population total can be obtained by thinking of the clusters as the sampling units and the totals in the clusters as the measurements.
Unbiased estimate of population total:
We have a simple random sample of clusters, so the standard error of is
.
Let be the number of ultimate sampling units in the population. An unbiased estimate of the population mean is:
and
.
R code for example:
residents=c(8,12,4,5,6,6,7,5,8,3,2,6,5,10,9,3,6,5,5,4,6,8,7,3,8);
total.incomes=c(323976,585648,160480,174957,389502,229814,390556,239585,390675,111801,86053,221220,99590,569351,384167,67993,213193,109360,201934,138059,151352,271575,351824,123520,302499);
N=415;
n=25;
# Unbiased estimate of population total total.unbiased=(N/n)*sum(total.incomes);
st.sq=var(total.incomes);
se.total.unbiased=N*sqrt((1-n/N)*st.sq/n);
total.unbiased;
[1] 104392154
se.total.unbiased;
[1] 11485990
# Unbiased estimate of population mean
K=2500;
ybar.unbiased=total.unbiased/K;
se.ybar.unbiased=se.total.unbiased/K;
ybar.unbiased;
[1] 41756.86
se.ybar.unbiased;
[1] 4594.396
However, to use this estimate of the population mean, we need to know and we often know only for the sampled clusters. Futhermore, we usually expect to be correlated with .
> cor(residents,total.incomes)
[1] 0.9028895
The correlation between the cluster sizes and cluster totals argues for using ratio estimation with as the auxiliary variable.
Option#2: Ratio estimates
The population mean can be expressed as the ratio of the mean of the cluster totals to the mean of the cluster sample sizes:
The ratio estimates of the population mean and population total are:
We can think of our sample as a simple random sample of cluster totals on which we are using a ratio estimator with auxiliary variable sample size.
Using the formula for the standard error of a ratio from Chapter 3, we have
and consequently
If , the average cluster size in the population, is unknown, one may substitute the average of the cluster sizes in the sample, , for in the above formulas.
Application to example:
> # Ratio estimate of population mean and total
> ybar.ratio=sum(total.incomes)/sum(residents);
> Mbar.U=K/N;
> se.ybar.ratio=sqrt((1-n/N)*(1/(n*Mbar.U^2))*sum(residents^2*(total.incomes/residents-ybar.ratio)^2)/(n-1));
> t.ratio=K*ybar.ratio;
> se.t.ratio=N*se.ybar.ratio;
> ybar.ratio;
[1] 41646.91
> se.ybar.ratio;
[1] 2200.221
> t.ratio;
[1] 104117285
> se.t.ratio;
[1] 913091.7
The standard error of the ratio estimate of the population mean is about half that of the unbiased estimate of the population mean ($2200 compared to $4594).
III. Comparison between cluster sampling and simple random sampling
We will compare the variance of cluster sampling to that of simple random sampling in the case when all the cluster sizes equal a common value, and .
We can partition the total variability in the data into
between- and within- cluster variance, using an ANOVA table:
Source / df / SS / MSBetween clusters / / /
Within clusters / / /
Total / / /
In one stage cluster sampling, the variability of the sample mean depends entirely on the between-cluster variability:
Cluster sampling is more efficient the more similar the cluster means are.
For a simple random sample of the same size as the cluster sample:
So we can see that if , then cluster sampling is less efficient (higher variance) than a simple random sample of the same size. This will happen when there is considerable variability between clusters. There is often considerable variability between clusters in cluster sampling. For estimating average reading scores, if we took a cluster sample of classes and sampled all students within the selected classes, we would likely find that average reading scores varied from class to class. An excellent reading teacher might raise the reading scores for the entire class; a class of students from an area with much poverty might tend to be undernourished and not score as high at reading. Unmeasured factors, such as teaching skill or poverty, can affect the overall mean for a cluster, and thus cause MSB to be large.
1