You Can't Say Anything About Bias from One Sample

Stat 406 – Spring 2018

Exam 1 -- Answers

Short answer questions: 4 pts each part, except for all parts of question 3, which were 2 pts each.

1) Sampling milkweeds

a) What can you say (about bias) from these data? Briefly explain your choice. If you can't say anything about bias, explain why not.

You can't say anything about bias from one sample.

Notes: 1) Bias is a property of the average (theoretical or numerical) of estimates many repeated samples. If a single estimate is far from the truth, it could be because of bias or because of large variability.

2) Most folks missed this point.

b) Which combination (A, B, C, D, or E) is the least biased? Explain your choice. If you can’t tell from these data, explain why not.

C is the least biased. It has a mean that is closest to the truth.

Notes: The standard definition of bias (and the one I used to define bias in 406) is the difference between the true value (the parameter) and the average estimate (i.e. the mean). Sometimes folks talk about median unbiasedness, which is no difference between the parameter and the median. The standard deviation and the difference between the mean and median have nothing to do with bias. The sd tells you about precision; the difference between mean and median is one measure of skewness.

c) Which combination (A, B, C, D, or E) is the most precise?Explain your choice. If you can’t tell from these data, explain why not.

E is the most precise. It has the smallest standard deviation.

Notes: A lot of folks used the se to answer 1c (and so answered B). Names get confusing here. The variability in the estimates is measured by their standard deviation. When you have multiple estimates, e.g., from a simulation, you can calculate that sd directly. If only have one sample and can make some assumptions, you can estimate that sd by the standard error that incorporates the number of original observations. The number of simulations (used in the se’s in the table) is totally arbitrary. I can make a se smaller by increasing the number of simulations; that doesn’t make an estimator more precise. (And the sd will not change, on average, as the number of simulations increases).

2) Sampling stream width

a) Is it appropriate to estimate μ by the average of the measured stream widths: ? Briefly explain why or why not.

Yes, the sample average estimates the population average, even for a systematic sample, so long as there is a random start.

Note: Because there is spatial correlation, this is an 'ok but could be improved' situation.

b) Is it appropriate to calculate the precision of your estimate by the "usual" standard error of the mean: , where is the sample standard deviation? Briefly explain why or why not.

No, the usual standard error is wrong for systematic samples with spatial correlation

Note: don't know whether it is too small or too large - depends on the sample spacing and the range of the correlation.

3) Distances in the Midwest and South.

a) Is it appropriate to measure the distance between points A and C using UTM coordinates in zone 15? Briefly explain why or why not.

Yes, both locations are in UTM zone 15.

b) Is it appropriate to measure the distance between points A and D using UTM coordinates in zone 15?Briefly explain why or why not.

No, D is in a different zone.

Note: I accepted a Yes with an explanation along the lines of 'but the distortion is not too great'

c) Is it appropriate to measure the distance between points A and B using great circle distance when the latitude and longitude for point A are from a GPS unit and that for point B are from an old USGS topo map using the NAD27 datum? Briefly explain why or why not.

No, GPS uses WGS84, which is a different datum from NAD27.

d) Is it appropriate to measure the distance between points A and D using great circle distance when the latitude and longitude for point A are from a GPS unit and that for point D are from a newer USGS topo map using the NAD83 datum? Briefly explain why or why not.

Yes, NAD83 reconciled with WGS84, so can compare the two.

4) Randomization test for new statistic.

a) Based on the 20 simulations reported above, what is the one-sided p-value? If you can’t compute this without further information, what information do you need?

p = 2/21 = 0.095.

Notes: One-sided, so only count larger than 5.82. This is a sample of possible permutations, so p-value has +1 to numerator and denominator.

b) For this data set and test statistic, is it reasonable to use a normal approximation to compute the p-value? Briefly explain why, why not, or why you can’t tell.

No. The distribution of randomized test statistics is not symmetrical around 4.

Note: Quite a few folks looked at the number of simulations (20) and said yes, N ≥ 20, so a normal approx. appropriate. That is based on the number of regions (raw data), not the number of simulations, and is an approximate guide. When N ≥ 20, the distribution of the Moran’s I (or Geary’s c) is approximately normal. You don’t know whether that holds for this statistic in general. You do have the distribution of the statistic for one sample of data. That is clearly not normal.

5) Spatial patterns in central Iowa.

a) When you do a Moran’s test, the estimated I = -0.066 and the two-sided p-value for the test of no spatial correlation is 0.68. Based on this information, what can you conclude about the spatial correlation of your response?

No evidence of spatial correlation.

Note: remember accepting the null does not mean there is no correlation. You just don’t have any evidence of it.

b) You decide to do a local Moran’s I analysis.

Based on the local Moran’s I analysis, what can you say about the spatial correlation of your response?

There is strong positive correlation in two regions and strong negative correlation in the middle of the study area.

c) You decide to smooth the data, using the Fay-Herriot method discussed in class. Two observations are numbered in red (1 and 2). These have very similar raw values but quite different smoothed values. Why are their smoothed values so different?

They have different observation variances.

Notes: Point 1 is smoothed more, so it had the higher variance. Quite a few folks answered in terms of the neighboring values, which is correct for other smoothing methods, but is not the major effect for the “simple” Fay-Herriot that we talked about. That got partial credit.

6) Predicted values from a SAR model

a) Describe the difference between two types of predicted value.

a is the prediction based only on the trend and other covariates. b includes spatial neighbors.

b) Evaluating the distribution of ν.

You would use residuals from b. Those residuals estimate ν.

7) Two SAR models

a) Which model gave me the estimate of 0.7?

Model A with a constant mean.

Note: quite a few folks turned this around. The data are the same for both models – the issue is what variability is considered to be spatially correlated (0.7) and what is modeled by the trend. When the trend is simple (a constant mean), everything goes into the spatial correlation. When the trend is sufficiently complicated to explain all the variability (as is the case here), there is no or almost no spatial correlation.

b) Briefly explain why the two estimates are so different.

Model A has no trend, so all the spatial variation is put into the random spatial component. It appears that all the spatial variation can be described by the quadratic trend because there is almost no random spatial dependence in model B.

Data analysis problem (36 pts):

8) lip cancer in Scotland.

There are various approaches you could have taken for many of these questions. My R code is in scotland.r. These answers are what I would do, and what I believe is the most common way to answer the question. If I wrote OK on your answer, you gave me something else that was acceptable.

The investigators want to know the following about lip cancer patterns in Scotland:

When you ignore any trend, are log ratio values spatially correlated?

Yes, Moran's I test on logratio values (either normality or Monte-Carlo), p < 0.0001

If so, how strong is that correlation?

I = 0.49, moderate positive correlation

You consider four models. All have a NS trend using . and spatial dependence:
independence, SAR with row-std weights, SAR with binary weights, CAR with binary weights

No answer needed

Which spatial dependence model is most appropriate for these data?

SAR with row-standardized weights. Smallest AIC statistic. No other model within 2 units

Is there a N-S trend in log ratio values?

Yes, p value for northkm = 0.0079

How much does the log ratio increase per km NS? What is the se of that slope?

0.0059 per km (se = 0.0022)

Does a N-S trend explain all the spatial dependence?

No, there is residual spatial correlation after accounting for trend.

Either of two possible approaches accepted:

look at the estimated lambda from the SAR model: 0.61 (p < 0.0001)

run Moran's I on residuals (based only on NS trend): I = 0.34, p < 0.0001

Note: if you try Moran's I on residuals that include neighbor effects, you get I = 0.036, p = 0.28

those residuals have minimal spatial correlation, because they are based on predicted values that include neighbors, so they account for the spatial dependence. Those residuals should be independent (see question 6b).

Plot a smoothed map of the log ratio using predictions from a model with N-S trend and spatial correlation.

This plot is based on the trend + spatial predictions from the SAR/row-std model. I accepted for full credit the Fay-Herriot predictions, so long as the model included northkm.

Notes: Maps using the independence model predictions or just the trend part of the SAR model ignore spatial correlation. Maps using Fay-Herriot with a constant mean (~1) ignore the trend.