Using GIS to generate spatially-balanced random survey designs for natural resource applications

David M. Theobald1, Don L. Stevens, Jr.2, Denis White3, N.Scott Urquhart4, Anthony R. Olsen3 and John R. Norman1

Running title: Spatially-balanced survey design using GIS

1Natural Resource Ecology Lab, Colorado State University, Fort Collins, CO 80523-1499

2Department of Statistics, OregonStateUniversity, Corvallis, OR97331-4501

3U.S. Environmental Protection Agency, Western Ecology Division, Corvallis, OR97333

4Department of Statistics, Colorado State University, Fort Collins, CO 80523

Communicating author: D. Theobald,

Submitted to: Environmental Management

Abstract

Sampling of a population is frequently required to understand trends and patterns in natural resource management because financial and time constraints preclude a complete census. A rigorous probability-based survey designspecifies where to sample so that inferences from the sample apply to the entire population. Probability survey designsshould be used in natural resource and environmental management situations because it provides the mathematical foundation for statistical inference. Development of long-term monitoring designs in particular demand survey designs that achieve statistical rigor and are efficient, but remain flexible to inevitable logistical or practical constraints during field data collection. Here we describe a recently developedapproachto probability-based survey design termed spatially-balanced sampling. We developan algorithm,called the Reversed Randomized Quadrant-Recursive Raster, to implement this approach in a geographic information system.This approach provides environmental managers a practical tool to generate flexible and efficient survey designs for natural resource applications.Moreover, factors commonly used to modify sampling intensity, such as strata, gradients, or accessibility, can be readily incorporated into the spatially-balanced sample design. We provide examples of survey designs for point, line, areal-based features (e.g., lakes, streams, and vegetation).

Keywords

Monitoring, spatial sampling, probability-based survey, GIS, accessibility

Introduction

Understanding the status and trends of natural resources is an important goal in natural resource management. Sampling is frequently required bBecause financial and time constraints usually preclude a complete census of an entire population of interest, sampling is frequently required. For example, the US National Park Service’s Inventory & Monitoring Program is currently undergoing an extensive effort to determine park “vital signs”.Developing useful monitoring designs is a key component to this effort (Oakley and others 2003).Monitoring design requires careful consideration of the resource to be monitored (target population), what will be measured (indicator), how it will be measured (response design), where it will be monitored (survey design), how frequently it will be monitored (time selection), and how measurements will be summarized (population estimation).

We have three goals in this paper. First, we briefly review the need for a rigorous survey design so that characteristics about the whole study area or population may be estimated from a sample or part of the population (Thompson 2002).In natural resource applications, features of interest (resources) to be surveyed are typically identified by their location.This contrasts with the perspective under which classical sampling developed, namely that units are selected from a list. Consequently, natural resource sampling needs an alternative perspective: sampling in space. Second, we describe the advantages of a recently developedapproach called spatially-balanced sampling (SBS)and briefly compare it to common designs for probability-based sampling. We argue that SBS is a practical alternative to simple random samplingand has a number of advantages for natural resource applications. Finally, we describe a novel implementation ofa spatially-balanced design within a geographical information systems (GIS) framework.We believe that by providing an implementation of anSBS algorithm as a GIS tool, the SBS method will be more accessible and useful to natural resource managers.

Sampling populations

Two scientifically defensible approachesexist for extrapolating from a sample to an entire population: model-based and design-based inference (Smith 1976; Särndal 1978; Hansen et al.1983).Model-based and design-based issues are discussed in a spatial context by de Gruijter and Ter Braak (1990) and Brus and de Gruijter (1993).Model-based inference relies on an explicit specification of the relationship between the sample and the population, and it enables very general and precise inference from limited data.However, if the model is not a faithful description of reality, estimates of the population based on the model may have little resemblance to the true population values. Practically speaking, mostreal-world ecological systems are simplytoo complex to use model-based inference methods with muchconfidence.

Design-based inference begins with the use of a probability survey design to select the sample and then using the probability survey design randomization as the basis for inference from the sample to the entire population. Probability-based sampling should be used in natural resource and environmental management situations because it provides the mathematical foundation for statistical inference (Stehman 1999). For example, sample locations are selected from space with a known probability that may reflect other factorssuch as physical characteristics related to the response variable (e.g., topography, vegetation type, precipitation). Provided the design is properly applied, the resulting inference can be made with known confidence, and confidence intervals can be narrowedby increasing sample size. This paper describes probability survey designs from the perspective of selecting resources at locations within a study area.

The key characteristic of a probability-based sampleis the specification of an inclusion probability (π) that is known and non-zero. That is, all locations within a study area (i.e., all members of the population) havesome known chance of being selected. This contrasts sharply with non-probability-based sampling, where samples are selected in an ad hoc manner and the likelihood of selection is not known. Violation of known and non-zero probability often occurs in natural resource applications when representative locations are chosen based on judgment alone, for example when locations are selected because of convenient access, or when patches of homogeneous vegetation meeting minimum size constraints are selected as “training sites” for classification of a remotely sensed image (Stehman 2001). A related issue is thatexclusion zones are often used to reduce an area that is needed to be sampled, such as restricting sampling to patch interiors, to locations nearby roads, or to certain types of patches. Exclusion zones narrow the search area, but because the inclusion probability for exclusion zones is zero, no inference from the samples can be extended to exclusion zones (Stehman 2001).

Although exclusion zones seemingly make more efficient use of scarce sampling resources, they are often problematic.For example, preliminary information on distribution and life history requirements of a species or process of interest may be used to exclude certain land cover types from a survey design (e.g., non-riparian cover types). Three potential problems arise with use of these exclusion zones. First, occasionally knowledge where species “should” occur is limited and scientists later are “surprised” to find species extending beyond where they were originally thought to reside (e.g., Thompson 2004). Second, spatial data on surrogate variables used to generate the exclusion zones commonly havesome level of uncertainty associated with them (e.g., a polygon is misclassified), data are not fine-grained enough to resolve important features on the ground (e.g., small, narrow strips of riparian vegetation), and/or are inadequate temporally (e.g., wrong time of year or not current enough). Third,the variability of natural processes (e.g., El Nino/La Nina cycles, fire/disease outbreaks) and human-induced changes (e.g., climate change and land use change) often causes shifts in the distribution of a species or process of interest.Without some, perhaps modest, allocation of resources to “exclusion zones”, trends for the entire resource cannot be documented in a direct way (Peterson et al. 1999).

Because we believe there are important advantages of design-based inference using probability sampling, we will not consider non-probability-based survey designs further in this paper.

Probability-based survey designs

Stehman and Overton (1994, 1996) give an overview of the variety of traditional, basic probability-based spatialsurvey designs that have different strengths and weaknesses for natural resource applications, including simple random, systematic, stratified, and cluster sampling. Below we briefly describe these more common methods, and then we introduce spatially-balanced sampling and highlight its advantages for natural resource management applications.

Following Stehman (1999), we briefly summarize the relative trade-offs of these different survey designs (Table 1).Simple random sampling (SRS) is used to sample from a population S by generating a series of random locations, x,y values, that are paired to form a set of s random sample locations. SRS is simple and flexible, and additional samples can be easily added to an existing set of samples. Although SRS designs may generate a well-distributed sample, it is commonly recognized that any single realization of a set of s random points generated from SRS often has clusters of samples or areas devoid of samples (Stevens and Olsen 2004). Systematic sampling (SyS) locates sampling points regularly, usually equally spaced on a regular grid or along a linear feature covering the entire population (Gilbert 1987; Flores 2003). The design is simple to implement and ensures that all portions of a study area are represented in the sample, which is also called spatially well-distributed or spatially well-balanced. Stratified random sampling (StRS) provides some spatial structure to the overall population through strata, and each strata is then sampled independently (commonly using SRS). Unequal probability sampling (Overton 1993) is an alternative to StRS. Cluster sampling (CS) is often used when there is some natural spatial grouping of population units. For example, a sample of basins might be drawn using SRS and then one might observe all stream segments within the selected basins, or utilize two-stage sampling to select a sample of streams within the sample basins. Cluster sampling is often used for administrative or operational convenience, because it is often much easier to implement than SRS. The initial location of each cluster can be selected using SRS or a spatially constrained design, e.g., SyS or StRS. Adaptive cluster sampling is often used for rare or elusive species when individuals occur in groups (Thompson 2004). Sites are selected, usually by SRS. Each site is visited, and if a target feature is found at that site (e.g., a rare species observed), then sampling sites are added in adjacent areas.

Developing a statistically rigorous, efficient, robust, and flexible sample in natural resource applications remains challenging because of tradeoffs between survey design issues and the practical challenges of field work.

Spatially-balanced survey design

Spatially-balanced sampling (SBS) is a relatively new approach to develop survey designs that are useful for natural resource researchers and practitioners (Stevens 1997; Di Zio et al. 2004). The SBS approach generates a survey design that is probability-based, spatially well-balanced, flexible, and typically results in low to moderate variance.

Commonly natural resource data are spatially autocorrelated, which means that: “Everything is related to everything else, but near things are more related than distant things” (Tobler 1970, pg. 236) or that sample locations that are close together tend to be more similar than more distant ones (Stevens and Olsen 2004). Spatial balance can be visualized by computing Thiessen or Voronoi polygons around sample locations. A sample is well balanced when the polygon areas around sample locations are similar (low variance in polygon area). As a result, a SBS design improves the efficiency of estimated values by maximizing spatial independence among sample locations. Griffith (2005) investigated the role of sample intensity on the effective sample size in the presence of spatial autocorrelation.

Spatialbalance leads to more efficient sampling, defined as providing more information per sample, because it attempts to maximize the spatial independence among sample locations. A useful way to measure statistical efficiency of survey design is to compute a spatial efficiency ratio (ER) of the variance of the area of the Voronoi polygons formed by a SBS design vs. a SRS design (Stevens and Olsen 2004). If ER <1.0, then the SBS design is more spatially efficient than SRS.

Commonly natural resource managers face a variety of uncertain and often unforeseen forces that influence survey design,so a flexible design is paramount. A primary uncertainty of most projects is that the actual number of samples that can be collected in the field is not known reliably during the construction of the survey design. For example, the cost to acquire each sample varies, depending on factors such as travel costs, ease of access, and possible economies of effort for nearby locations. Often the amount of funding to support a sampling effort decreases during the course of a project. Moreover, the extent of the population being sampled can change over time, so adjustments to the sampling frame (or study area) need to be made as refinements to the population are made. Perhaps more common is that a sample location cannot be visited because of denial of access (e.g., a private land owner or military base) (Lesser 2001), because the feature thought to be located there does not exist (e.g., a dried-up stream or incorrect vegetation type), or because a location simply requires too much effort or is too dangerous to access (e.g., a steep rock cliff or swift stream). Occasionally, logistical, technical, or safety issues in the field preclude collection of a sample at a location, such as the inability to use GPS to locate a sample point in a deep canyon or dense forest. All of these situations result in what is known as non-response problem (Cochran 1977). Note that the non-response problem also plagues studies that employ remote sensing data (perhaps less so), such as when aerial photo data are not available for a particular portion of a study area or a cloud covers a portion of an image. An SBS design is more robust to these uncertainties because it allows over-sampling, or additional sample locations to be generated, such that additional samples remain well balanced.

Our method to implement a SBS design is similar to the Generalized Random Tessellation Stratified (GRTS) algorithm (Stevens 1997; Stevens and Olsen 2000; Stevens and Olsen 2004). GRTS is a type of stratified survey design that is probability-based, has the advantage of providing a spatially-balanced design, and was developed for monitoring natural resource trends (e.g., Hall et al. 2000; Herlihy et al. 2000).

There are several advantages of generating an SBS design within a geographic information system (GIS) framework. First, GIS is typically used to establish the sample or reference frame. That is, spatial data is typically needed to represent the population – the spatial extent and location of the study area to be sampled or the set of geographic features to be sampled from (e.g., targeted vegetation types, sections of a stream that provide habitat for a target species, a set of lakes). Note that a full range of geographic features, including points (e.g., centers of lakes or pre-defined stream reaches), lines (streams, roads, etc.), areas (vegetation patches, lakes, estuaries, etc.) can be sampled. If certain geographic regions or particular features need to be sampled at different intensities (e.g. rare vegetation types or higher-order streams), a spatial dataset provides a convenient way to specify the unequal probabilities associated with different geographic features. Occasionally inclusion probabilities may need to vary continuously across a surface to reflect an environmental gradient such as elevation or precipitation. This can be accomplished using a raster-based representation of spatial data. Second, GIS allows the visualization of a completed survey design so that the samples can be viewed within the context of other geographic layers, such as vegetation types, roads and trails for access, and land ownership. Examining the context of the samples is useful to evaluate a survey design and to develop a logistical plan for collecting the field data. Third, we believe that a broader base of potential users will be reached. Many natural resource professionals use GIS regularly, but may not be as familiar with statistical packages (such as S-Plus, SAS, or R) and the methods needed to generate survey designs.

Most standard GIS provide a limited set of tools to construct rigorous spatial survey designs, and although SRS designs can be generated in most GIS, they typically require custom programs or scripts. For example, Theobald (2003) described random sampling with unequal probabilities using continuous (raster) data in ArcGIS (Environmental Systems Research Institute, Redlands, CA), Huber (2000) developed methods for systematic sampling, Jenness (2001) created a random point generator in ArcView v3 (Environmental Systems Research Institute, Redlands, CA), the IDRISI system (Clark Labs, Worcester, MA) has embedded the GStat package into their GIS (Pebesma and Wesseling 1998; Pebesma 2003), and sampling routines have been written for GRASS in the r.le (landscape ecology; Baker and Cai 1992) and r.samp routines (Mitchell and others 2002). These existing extensions to GIS provide tools to generate simple, systematic, and stratified unequal probability designs, but not a spatially-balanced design. (Mention GRTS r-based tools here? I think they should be mentioned somewhere- nsu)

Below we describe a new approach to generate spatially-balanced survey designs, called the Reversed Randomized Quadrant-Recursive Raster (RRQRR). We explain how this approach differs from GRTS and describe its implementation in ArcGIS (using ESRI software ArcGIS v9). Finally, we provide case studies to illustrate the sampling methodology for geographic features represented by points, lines, and areas.

Methods

Generating a spatially-balanced sampleis accomplished by using a function that converts 2D space into 1D space, i.e., locations (or cells)defined by an x, y (or row and column) are converted to a single list in some way. Three common sequential ordering systems used are (1) row-major order (left to right, row by row), (2) row-prime order, also known as boustrophenon ordering, meaning “like an ox plowing a field” (Goodchild and Grandfield 1983) and (3) Peano or Morton scan ordering. The average distance between two sequential numbers can vary widely in these methods (Goodchild and Grandfield 1983), and long jumps can occur between values of adjacent cells in adjacent rows.