Homework 4 Revised to Drop Q 4 Parts G,H,I

Stat 406 – Spring 2018

Homework 4 – revised to drop Q 4 parts g,h,i

Due 5 pm, Friday, 30 March, to my office or mailbox (1121 Snedecor)

1) This problem explores block kriging. It builds on the soil erosion data analyzed in HW 3. The motivation is to estimate the total soil erosion summed over the study region. We will focus on estimating the average across the study area (in units of T/ac/yr, which can easily be turned into a total (MT/yr) by multiplying by the study area and converting T (metric tons) to MT (mega tons). The x and y coordinates are in meters. An appropriate semivariogram is Spherical with a partial sill of 2.5, range of 150000 and nugget of 0.2. The data are in erode.txt. A dense (relatively) grid of locations is in erodegrid.txt. A very coarse grid of 4 locations is in erodegrid2.txt, and the centroid of the study area is the first location in erodegrid3.txt. (I wanted to provide just that one location, but krige() demands more than one prediction location).

a) Use ordinary kriging with the provided semivariogram to predict soil erosion at each location in the dense grid (erodegrid. txt). Your answer is the spatial plot of these erosion values.

R note: you can provide a specific variogram (instead of an estimated variogram) to krige() by using a vgm() specification as the variogram model, i.e. the fourth argument to krige(). So, krige( erode~1, erode.sp, erodegrid, vgm(2.5, 'Sph', 150000, 0.2) ) will make predictions using the specified fixed effects model , the data, the grid, and variogram model.

b) Estimate the average soil erosion across the study area by averaging the point predictions on the dense grid. What is that estimated average soil erosion?

c) Use block kriging to estimate average soil erosion in squares with size (31800, 31800) centered at each location on the dense grid. Averaging the block predictions on the dense grid to estimate the average soil erosion across the study area. What is that estimated average soil erosion?

R note: adding block=c(31800, 31800) will specify blocks of size 31800 x 31800 centered at each grid location. (2nd note: the documentation is unclear about this and mostly concerns a more flexible by more complicated way of specifying blocks. As you will see, this is close if not perfect.)

d) Plot the point predictions (from b) against the block predictions (from c) for each location on the dense grid. Are they similar? If not, why not?

e) Plot the variance of the point predictions (from b) against the variance of the block predictions (from c) for each location on the dense grid. Are they similar? If not, why not?

f,g) Now consider the very coarse grid (erodegrid2.txt locations).

f) There are a total of 4 locations here. Predict the soil erosion at each of the four locations, and average to estimate the average over the study area.

g) Use block kriging with blocks centered on the very coarse grid. Each block has size (127200, 270300). Average the block predictions to estimate the average over the study area.

h,i) Now consider the single location at the center of the study area (erodegrid3.txt locations)

h) Predict the soil erosion at the center of study area. What is that estimate?

R note: because krige() insists on having more than one prediction location (as far as I can tell), erodegrid3 has 5 locations: the first row is the center; the other four are the four very coarse grid locations. If you have predictions and their variances stored in pred, then pred[1,] gives you the information about the first location (the center of the study area).

i) Use block kriging with blocks centered on the very coarse grid. Each block has size (127200, 270300). What is the block prediction of the average over the study area.

j) You should find that the three block-prediction-based estimates are very similar but the three point-prediction-based estimates are somewhat different. Explain why this should be expected.

k) You should find that the estimate by averaging point predictions on a dense grid is the closest to every block-prediction-based estimate. Explain why this should be expected.

l) Block kriging gives you a something (in addition to a prediction) that is not available (at least in any simple way when you average point predictions, especially a dense grid of point predictions). What is that quantity?

2) This problem explores indicator kriging using the Jura Co data. Use the data from 259 locations in juraCoFull.txt. Imagine that a soil Co level of 10 ppm has been defined as a regulatory concern. Create an indicator variable that has the value of 1 when Co > 10 and 0 when Co ≤ 10. I have already estimated a semivariogram based on the indicator values. A reasonable semivariogram is Matern k = 1, with a partial sill of 0.2, a range of 0.30, and a nugget of 0.067. A grid of locations throughout the Jura is in juragrid (just like last week).

a) Use the data and the specified indicator semivariogram to predict the probability that Co > 10 at each location on juragrid. You answer is the spatial plot of that probability.

b) Use the data and the semivariogram from HW 3 (Sph with partial sill = 14.4, range of 1.22, and nugget of 0.33) to predict the Co concentration at each location on juragrid. Visually, is the spatial plot of the predicted concentration similar to the spatial plot of the prediction of P[Co > 10] from part a? What is similar (if anything) and what is different (if anything)?

c) It is tempting to think that there will be a one-to-one correspondence between the two plots (probabilities in a and values in b). If there were such a correspondence, then a predicted value of 10 (for example) would always be associated with one specific probability (e.g., 0.5). Use the predictions in part a and part b to see if there is such a one-to-one correspondence. Your answer is yes or no and the evidence that supports that answer.

Hint: You could answer this by considering what gets averaged to make the predictions in both parts. You could alternatively make a non-spatial plot of one prediction against the other.

3) This problem explores fitting linear models to spatial data. The data are simulated based on a spatial experiment I worked on early in my career. This was a study of competition between larval salamanders. Salamanders lay eggs in water, often temporary ponds. The larvae are aquatic. When they reach a certain size or age, they metamorphose into terrestrial adults. The lab where I used to work did extensive studies on amphibians. The Rainbow Bay study now has the record for the longest running daily study of amphibian dynamics: monitored every day for now almost 40 years https://discover.uga.edu/article/sus15-srel.

This question is modeled on an experimental study of competition between different species of Ambystoma. There are three common species A tigrinum, A. opacum,and A. talpoideum. This study evaluated the impact on A. talpoideum of the other two species. Salamanders are nice because you can raise them in mesocosms (medium-size experimental habitats). In this case, the mesocosms were kiddie wading pools 1.81 m (6 ft) in diameter. There were four experimental treatments: all combinations of species of competitor (A. tigrinum or A. opacum) and number of competitors (5 or 25). Each mesocosm has 10 A. tigrinum (the target species) and the appropriate number of the competitor species. The response is the average log-transformed size of A. tigrinum after 3 months growth. The presumption is that competitors will reduce the growth rate of A. tigrinum and hence the average size at 3 months. The question is whether the two competitor species are equally competitive (so the reduction in size is similar) or whether one is a stronger competitor than the other (so the reduction in size between 5 and 25 individuals is not the same). This can be quantified by fitting the linear model:

where Si indicates the species (0 for tigrinum, 1 for opacum) and Nj is the number of individuals (5 or 25). The two species are equally competitive when because then the slope relating size to number of individuals is the same for both species.

In pictures, this is whether A or B is more appropriate (plots on next page). A shows a plot where and B shows a plot where .

The mesocosm facility had 36 mesocosms (the kiddie pools). These were arranged in a 6 x 6 square. The study was conducted using a randomized complete block design. Nine blocks had 4 pools in 2 x 2 squares (so 3 rows each with 3 blocks going across the facility). The four treatments were randomly assigned to the four pools in a block. There is one of each treatment in a block (i.e. a complete block design). Blocks were used to control spatial variability. They were not used for any other purpose (so all blocks set up the same day and harvested the same day). The data are in amphib.csv. x and y give the location of the pool in the facility, block is the block number, spp is the species of competitor, n is the number of competitors, and size is the average size of A. tigrinum at 3 months.

To check the basic setup of your model for these data, if you fit the OLS model with blocks (as a factor variable called block.f) and a 2 way factorial treatments, i.e. lm(size ~ block.f + spp + n + spp:n, data=amphib), you should get an estimated interaction coefficient (for spp:n) of 0.1148 with a se of 0.0483, and a p-value of 0.0259.

For all models with spatially correlated errors, use an isotropic Exponential model without a nugget.

a) You suspect that the variability between plots various continuously over space, instead of being step changes at each block. So, you fit a model with spatially correlated errors. Omit blocks from the model (so just size ~ spp + n + spp:n for the model formula). What are the estimated interaction coefficient, its se, and p-value?

R Note: the quantities you need to report can be extracted directly from the fitted object. You do not need to use emmeans functions.

b) Now fit a model with blocks included as fixed effects (size ~ block.f + spp + n + spp:n) and spatially correlated errors. What are the estimated interaction coefficient, its se, and p-value?

c) The AIC for the "basic setup" model is 155.208. The default in R gls() is to use REML to estimate parameters . What are the AIC statistics for the models in parts a and b? Is it appropriate to choose a model by comparing these three AIC values to each other? Briefly explain why or why not.

4) The locations in tornado.csv are the touchdown locations for all recorded strong tornadoes in Iowa between 1950 and 2016. There are 161 tornadoes classified F3, F4 or F5 (the three strongest levels in the Fujita scale). The locations are locations of the “touchdown” point in UTM zone 15 coordinates. These data were extracted from TornadoHistoryProject.com. The points in ia.csv define the border of IA, again in UTM zone 15. There are 13 locations that are outside the state boundary. These will be ignored when you create the ppp object. That is fine. These are tornados that touched down outside the state and continued into the state. These are in the data base (because they were in the data base as “IA” tornados), but should be ignored. Our interest is in the spatial pattern of touchdown locations in the state of IA. Problem 5 will explore the intensity of touch-down locations. See the R notes at the end of this question for how to define the study window using an irregular polygon.

a) plot the locations. Your plot is your answer.

b) You want to know whether touch-down locations are clustered, randomly distributed, or spatially segregated. Use K(x) or L(x) to understand the spatial pattern. Describe in a sentence your interpretation of the spatial pattern (no hypothesis test needed or expected). Include a plot supporting your interpretation.

c) Use point-wise α=0.05 tests to determine whether the pattern shows clustering or spatial segregation. What do you think? Support your conclusion with an appropriate plot or test statistic(s).

d) Use a summary test over the interval from 1km to 50km to evaluate whether the data show some non-random spatial pattern. Report which test you performed, the p-value, and your interpretation of the results of this test.

e) You want to estimate the spatial scale of the pattern, i.e. if a tornado touches down at a location, how likely is that another one touched down 5km away, or 10km away, …? Which summary function, K(x), L(x), or g(x) is most appropriate to answer this? Briefly explain your choice.

f) Analyze the data using the method chosen in e). What do you conclude about the spatial scale of the pattern? Include an appropriate plot to support your conclusion.

5) We now turn to the intensity function (vis-à-vis point patterns, the average number of tornadoes in a small area, not tornado intensity).

a) What is the optimal bandwidth when using the Diggle-Berman MSE criterion? What about when using the point process likelihood criterion?

b) Plot the MSE criterion and the point process criterion for a range of bandwidths (I suggest from 5km to 50km). What do these plots tell you about the preciseness of the optimal bandwidth? In other words, for either method is there a clearly defined best bandwidth (precise) or is there a range of reasonable bandwidths (not precise). Include your plots with your answer.

c) Pick what you believe is a reasonable bandwidth and estimate the intensity of tornadoes over the state of Iowa. Your answer is your choice of bandwidth, a brief explanation of why you chose it, and a map of predicted intensity.

R notes:

Using an irregular polygon as the study window: A planar point pattern (ppp object) is created by the ppp() function. The arguments are the x coordinates of the events, the y coordinates of the events and something that defines the study window. In the example code, the study window is a rectangle, specified by owin( coordinates). Iowa is not a rectangle. A polygonal boundary, whether simple or complex, can be defined by coordinates along the boundary, moving in a counter-clockwise direction. Spatstat contributors are primarily Australian, Kiwi, Canadian, and British so the documentation says anti-clockwise. The coordinates I give you in ia.csv are in the correct order. You replace owin= in the ppp() call with poly=list(x = XXX, y= YYY), where XXX is the name of the vector of X coordinates along the boundary and YYY is the name of the vector of Y coordinates. The format allows very complicated geometries, e.g. disconnected areas that include interior holes.