Stat 475 Notes 4
Reading: Lohr, Chapter 3.1
Corrections from earlier notes:
- Notes 2: Bottom of page 10, top of page 11, the true population standard deviation should be
truesd.samplemean=((sd(dioxin)/sqrt(50))*sqrt(1-50/646))
# true SD of sample mean
> truesd.samplemean
[1] .3589683
- Notes 2: Equation (1.1) on page 15 should be:
(1.1)
Chart of minimum sample size needed for margins of error for a proportion, assuming the worst case that the true proportion is 0.5
Margin of error / Sample size0.01 / 9604
0.02 / 2401
0.03 / 1008
0.04 / 601
0.05 / 385
I. Ratio Estimation: Motivating examples
Suppose that for each member of the population, two variables are measured: .
Ratio Estimation: Estimation of .
The natural estimator of from a simple random sample is
.
Motivation 1: We are directly interested in estimating , i.e.,
the ratio of the population mean of to that of or equivalently the ratio of the population total of to that of .
Examples:
(1) In a survey of households, is total monthly food budget, is total monthly expenditures and is the proportion of expenditures that are spent on food.
(2) In a survey of farms, is acres of wheat planted, is total acreage and is the proportion of acres planted that are wheat.
Motivation 2: We are interested in the population total of but the population size is unknown. Thus, we cannot estimate by . However, suppose that the population total of is known. Note that . Then we can estimate by
.
Example:
The wholesale price paid for oranges in large shipments is based on the sugar content of the load. The exact sugar content cannot be determined prior to the purchase and extraction of the juice from the entire load; however it can be estimated. One approach is to estimate the mean sugar content per orange from a sample or oranges and then multiply by the number of oranges in the load. Unfortunately this method is not feasible because it is too time consuming and costly to determine (that is to count the total number of oranges in the load).
We can instead use the ratio estimation method with being the weight of an orange. It is easy to find by weighing the shipment or oranges. Then, we can estimate by taking a sample of oranges, finding the ratio of the sample mean of the sugar content of the oranges to the sample mean of the weight of the oranges and then multiplying this ratio by .
Motivation 3: We are interested in the population mean of , , but know and would like to use to improve our estimate by estimating by instead of .
Example: We will consider a population of short stay hospitals from Herkson (1976). The data is in hospitals.txt. Let denote the number of patients discharged from the th hospital in January 1968. We are interested in . Without doing any sampling, for each hospital i, we have available the number of beds in the hospitals and know .
The idea behind using the ratio estimator of instead of using is the following: We expect and to be closely related in the population, since a hospital with a large number of beds should tend to have a large number of discharges. A scatterplot of versus in the population is shown below:
R code:
hospitaldata=read.table("hospitals.txt",header=TRUE);
discharges=hospitaldata$discharges;
beds=hospitaldata$beds;
plot(beds,discharges);
Because of the close relationship between and , if the sample underestimates the mean number of beds, i.e., , then the sample probably also underestimates the mean number of discharges. The ratio estimate multiplies by , which is a better estimate than as long as is closely related to .
The following is a simulation study comparison of to for the hospital data for samples of size 64.
# Simulation study comparison of usual sample mean to ratio estimator for
# mean number of discharges in hospital data
nosims=10000;
samplemeanvec=rep(0,nosims);
ratioestvec=rep(0,nosims);
N=393;
n=64;
beds.population.mean=mean(beds);
discharges.population.mean=mean(discharges);
discharges.population.mean;
> discharges.population.mean
[1] 814.603
for(i in 1:nosims){
tempsample=sample(N,n,replace=FALSE);
discharges.sample=discharges[tempsample];
beds.sample=beds[tempsample];
samplemeanvec[i]=mean(discharges.sample);
ratioestvec[i]=(mean(discharges.sample)/mean(beds.sample))*beds.population.mean;
}
# Bias of two estimators
bias.samplemean=mean(samplemeanvec)-discharges.population.mean;
> bias.samplemean
[1] -0.4465956
bias.ratioest=mean(ratioestvec)-discharges.population.mean;
> bias.ratioest
[1] 0.7369439
# Mean squared error of two estimators
mse.samplemean=mean((samplemeanvec-discharges.population.mean)^2);
> mse.samplemean
[1] 4482.136
mse.ratioest=mean((ratioestvec-discharges.population.mean)^2);
> mse.ratioest
[1] 904.0052
# Histograms of the two estimators
par(mfrow=c(2,1));
hist(samplemeanvec,xlim=c(500,1100));
hist(ratioestvec,xlim=c(500,1100));
The ratio estimator is dramatically better than the sample mean for this population.
We will study the general conditions under which is better than below.
II. Standard Errors of Ratio Estimators
Unlike the sample mean, the ratio estimator is biased. Using a Taylor expansion,
Note that for a large sample size and a small sampling fraction , we have and (where the inequality follows from the Cauchy-Schwarz inequality) so that the bias is small.
Using a Taylor expansion, we have the following formulas for estimates of the standard deviations of various ratio estimators (these are not unbiased estimators but for a large sample size and a small sampling fraction , they are accurate):
For estimating a population ratio (Lohr, pg. 61, 68)
For estimating a population total (Lohr, pg. 61, 68)
For estimating a population mean (Lohr, pg. 61, 68)
Using a central limit theorem and the Delta method, an approximate 95% confidence interval for a population quantity of interest (e.g., ) is .
III. Comparison of the sample mean to the ratio estimate of the mean
We now compare the sample mean estimate of the population mean to the ratio estimator, . Let denote the mean squared error, i.e., . The mean squared error is a good measure for comparing the accuracy and desirability of estimators.
Define the population correlation coefficient to be
,
where are the population standard deviations of and .
Using a Taylor expansion approximation, we have that
If the coefficients of variation are approximately equal, then it pays to use ratio estimation when the correlation between and is larger than 0.5.
The same criterion applies for estimating the population total.
For the hospital data,
> cor(beds,discharges)
[1] 0.9109203
> ratio=discharges.population.mean/beds.population.mean
> ratio
[1] 2.964085
> ratio*sd(beds)/(2*sd(discharges))
[1] 0.5359784
Thus, and it pays to use ratio estimation.
In practice, we wouldn’t know but can estimate from the sample to decide whether to use ratio estimation.
Example: One of the main uses of ratio estimation is in the updating of information across time. A simple example of this can be seen in the way agricultural crop forecasters can use a sample of current data to update complete crop reports from earlier years. The crop used in this example is sugarcane, an important economic crop for the four states of Florida, Hawaii, Louisiana and Texas and grown in a total of about 32 counties from across those states. Our goal is to estimate the mean number of sugarcane acres harvested in these 32 counties. Suppose we are near the end of 1999 and do not have complete data on the sugarcane crop from that year from all counties. We do, however, have complete data for all counties for the year 1997. In addition, we have the resources to collect preliminary information from six sample counties. The table below shows the acres harvested for sugarcane in the six sampled counties.
State / County / 1999 Acreage / 1997 AcreageFL / Hendry / 57,000 / 54,000
HI / Kauai / 13,900 / 12,300
LA / Saint Landry / 15,500 / 9,100
LA / Calcasieu / 3,900 / 1,700
LA / Iberia / 59,900 / 57,200
TX / Cameron / 10,400 / 12,900
By checking the complete records for 1997, we can find that the 1997 average acres harvested per county, across all 32 counties, was 27,752 acres. Use these data to estimate the mean acreage for sugarcane across all 32 counties for 1999 and calculate an approximate 95% confidence interval.
Solution: Let be the acreage harvested in 1997 and be the acreage harvested in 1999. The plot of the sample data shows a strong, positive trend in the relationship between the acreage values for the 2 years. This bodes well for ratio estimation.
# Acreage of sugarcane harvested data
acreage.1997=c(54400,12300,9100,1700,57200,12900);
acreage.1999=c(57000,13900,15500,3900,59900,10400);
plot(acreage.1997,acreage.1999,xlab="Acreage, 1997",ylab="Acreage, 1999",main="Sugarcane acreage in 1997 versus 1999");
# Use criterion R>=(B*S_x)/(2*S_y) to decide whether to use ratio estimation
ratioest=mean(acreage.1999)/mean(acreage.1997);
[1] 1.088076
sxest=sd(acreage.1997);
syest=sd(acreage.1999);
ratioest*sxest/(2*syest);
[1] 0.5359384
Rest=cor(acreage.1999,acreage.1997);
[1] 0.9934721
The ratio estimator appears much better the sample mean since
The ratio estimate of the mean of acreage harvested in 1999 is
For calculating the standard error of the ratio estimate, we need to calculate
.
> sehatsq=sum((acreage.1999-ratioest*acreage.1997)^2)/5;
> sehatsq
[1] 11860709
Then we have
.
Thus, an approximate 95% confidence interval for is
.
Note that because the sample size is small, it would be a better to approximation to use the 0.975 quantile of the distribution with degrees of freedom rather than 1.96 in the confidence interval.
The 0.975 quantile of the t-distribution with 5 degrees of freedom is
> qt(.975,5)
[1] 2.570582
so the approximate 95% confidence interval is
.