Corrections from Earlier Notes

Stat 475 Notes 4

Reading: Lohr, Chapter 3.1

Corrections from earlier notes:

Notes 2: Bottom of page 10, top of page 11, the true population standard deviation should be

truesd.samplemean=((sd(dioxin)/sqrt(50))*sqrt(1-50/646))

# true SD of sample mean

> truesd.samplemean

[1] .3589683

Notes 2: Equation (1.1) on page 15 should be:

(1.1)

Chart of minimum sample size needed for margins of error for a proportion, assuming the worst case that the true proportion is 0.5

Margin of error / Sample size
0.01 / 9604
0.02 / 2401
0.03 / 1008
0.04 / 601
0.05 / 385

I. Ratio Estimation: Motivating examples

Suppose that for each member of the population, two variables are measured: .

Ratio Estimation: Estimation of .

The natural estimator of from a simple random sample is

Motivation 1: We are directly interested in estimating , i.e.,

the ratio of the population mean of to that of or equivalently the ratio of the population total of to that of .

Examples:

(1) In a survey of households, is total monthly food budget, is total monthly expenditures and is the proportion of expenditures that are spent on food.

(2) In a survey of farms, is acres of wheat planted, is total acreage and is the proportion of acres planted that are wheat.

Motivation 2: We are interested in the population total of but the population size is unknown. Thus, we cannot estimate by . However, suppose that the population total of is known. Note that . Then we can estimate by

Example:

The wholesale price paid for oranges in large shipments is based on the sugar content of the load. The exact sugar content cannot be determined prior to the purchase and extraction of the juice from the entire load; however it can be estimated. One approach is to estimate the mean sugar content per orange from a sample or oranges and then multiply by the number of oranges in the load. Unfortunately this method is not feasible because it is too time consuming and costly to determine (that is to count the total number of oranges in the load).

We can instead use the ratio estimation method with being the weight of an orange. It is easy to find by weighing the shipment or oranges. Then, we can estimate by taking a sample of oranges, finding the ratio of the sample mean of the sugar content of the oranges to the sample mean of the weight of the oranges and then multiplying this ratio by .

Motivation 3: We are interested in the population mean of , , but know and would like to use to improve our estimate by estimating by instead of .

Example: We will consider a population of short stay hospitals from Herkson (1976). The data is in hospitals.txt. Let denote the number of patients discharged from the th hospital in January 1968. We are interested in . Without doing any sampling, for each hospital i, we have available the number of beds in the hospitals and know .

The idea behind using the ratio estimator of instead of using is the following: We expect and to be closely related in the population, since a hospital with a large number of beds should tend to have a large number of discharges. A scatterplot of versus in the population is shown below:

R code:

hospitaldata=read.table("hospitals.txt",header=TRUE);

discharges=hospitaldata$discharges;

beds=hospitaldata$beds;

plot(beds,discharges);

Because of the close relationship between and , if the sample underestimates the mean number of beds, i.e., , then the sample probably also underestimates the mean number of discharges. The ratio estimate multiplies by , which is a better estimate than as long as is closely related to .

The following is a simulation study comparison of to for the hospital data for samples of size 64.

# Simulation study comparison of usual sample mean to ratio estimator for

# mean number of discharges in hospital data

nosims=10000;

samplemeanvec=rep(0,nosims);

ratioestvec=rep(0,nosims);

N=393;

n=64;

beds.population.mean=mean(beds);

discharges.population.mean=mean(discharges);

discharges.population.mean;

> discharges.population.mean

[1] 814.603

for(i in 1:nosims){

tempsample=sample(N,n,replace=FALSE);

discharges.sample=discharges[tempsample];

beds.sample=beds[tempsample];

samplemeanvec[i]=mean(discharges.sample);

ratioestvec[i]=(mean(discharges.sample)/mean(beds.sample))*beds.population.mean;

}

# Bias of two estimators

bias.samplemean=mean(samplemeanvec)-discharges.population.mean;

> bias.samplemean

[1] -0.4465956

bias.ratioest=mean(ratioestvec)-discharges.population.mean;

> bias.ratioest

[1] 0.7369439

# Mean squared error of two estimators

mse.samplemean=mean((samplemeanvec-discharges.population.mean)^2);

> mse.samplemean

[1] 4482.136

mse.ratioest=mean((ratioestvec-discharges.population.mean)^2);

> mse.ratioest

[1] 904.0052

# Histograms of the two estimators

par(mfrow=c(2,1));

hist(samplemeanvec,xlim=c(500,1100));

hist(ratioestvec,xlim=c(500,1100));

The ratio estimator is dramatically better than the sample mean for this population.

We will study the general conditions under which is better than below.

II. Standard Errors of Ratio Estimators

Unlike the sample mean, the ratio estimator is biased. Using a Taylor expansion,

Note that for a large sample size and a small sampling fraction , we have and (where the inequality follows from the Cauchy-Schwarz inequality) so that the bias is small.

Using a Taylor expansion, we have the following formulas for estimates of the standard deviations of various ratio estimators (these are not unbiased estimators but for a large sample size and a small sampling fraction , they are accurate):

For estimating a population ratio (Lohr, pg. 61, 68)

For estimating a population total (Lohr, pg. 61, 68)

For estimating a population mean (Lohr, pg. 61, 68)

Using a central limit theorem and the Delta method, an approximate 95% confidence interval for a population quantity of interest (e.g., ) is .

III. Comparison of the sample mean to the ratio estimate of the mean

We now compare the sample mean estimate of the population mean to the ratio estimator, . Let denote the mean squared error, i.e., . The mean squared error is a good measure for comparing the accuracy and desirability of estimators.

Define the population correlation coefficient to be

where are the population standard deviations of and .

Using a Taylor expansion approximation, we have that

If the coefficients of variation are approximately equal, then it pays to use ratio estimation when the correlation between and is larger than 0.5.

The same criterion applies for estimating the population total.

For the hospital data,

> cor(beds,discharges)

[1] 0.9109203

> ratio=discharges.population.mean/beds.population.mean

> ratio

[1] 2.964085

> ratio*sd(beds)/(2*sd(discharges))

[1] 0.5359784

Thus, and it pays to use ratio estimation.

In practice, we wouldn’t know but can estimate from the sample to decide whether to use ratio estimation.

Example: One of the main uses of ratio estimation is in the updating of information across time. A simple example of this can be seen in the way agricultural crop forecasters can use a sample of current data to update complete crop reports from earlier years. The crop used in this example is sugarcane, an important economic crop for the four states of Florida, Hawaii, Louisiana and Texas and grown in a total of about 32 counties from across those states. Our goal is to estimate the mean number of sugarcane acres harvested in these 32 counties. Suppose we are near the end of 1999 and do not have complete data on the sugarcane crop from that year from all counties. We do, however, have complete data for all counties for the year 1997. In addition, we have the resources to collect preliminary information from six sample counties. The table below shows the acres harvested for sugarcane in the six sampled counties.

State / County / 1999 Acreage / 1997 Acreage
FL / Hendry / 57,000 / 54,000
HI / Kauai / 13,900 / 12,300
LA / Saint Landry / 15,500 / 9,100
LA / Calcasieu / 3,900 / 1,700
LA / Iberia / 59,900 / 57,200
TX / Cameron / 10,400 / 12,900

By checking the complete records for 1997, we can find that the 1997 average acres harvested per county, across all 32 counties, was 27,752 acres. Use these data to estimate the mean acreage for sugarcane across all 32 counties for 1999 and calculate an approximate 95% confidence interval.

Solution: Let be the acreage harvested in 1997 and be the acreage harvested in 1999. The plot of the sample data shows a strong, positive trend in the relationship between the acreage values for the 2 years. This bodes well for ratio estimation.

# Acreage of sugarcane harvested data

acreage.1997=c(54400,12300,9100,1700,57200,12900);

acreage.1999=c(57000,13900,15500,3900,59900,10400);

plot(acreage.1997,acreage.1999,xlab="Acreage, 1997",ylab="Acreage, 1999",main="Sugarcane acreage in 1997 versus 1999");

# Use criterion R>=(B*S_x)/(2*S_y) to decide whether to use ratio estimation

ratioest=mean(acreage.1999)/mean(acreage.1997);

[1] 1.088076

sxest=sd(acreage.1997);

syest=sd(acreage.1999);

ratioest*sxest/(2*syest);
[1] 0.5359384

Rest=cor(acreage.1999,acreage.1997);

[1] 0.9934721

The ratio estimator appears much better the sample mean since

The ratio estimate of the mean of acreage harvested in 1999 is

For calculating the standard error of the ratio estimate, we need to calculate

> sehatsq=sum((acreage.1999-ratioest*acreage.1997)^2)/5;

> sehatsq

[1] 11860709

Then we have

Thus, an approximate 95% confidence interval for is

Note that because the sample size is small, it would be a better to approximation to use the 0.975 quantile of the distribution with degrees of freedom rather than 1.96 in the confidence interval.

The 0.975 quantile of the t-distribution with 5 degrees of freedom is

> qt(.975,5)

[1] 2.570582

so the approximate 95% confidence interval is