Exercise 2: Obtaining and Formatting Environmental Variables

Adam B. Smith & Danielle S. Christianson

This tutorial is available from

July 5, 2016

In this exercise you will learn:

  1. Where to obtain environmental data.
  2. How to prepare environmental data for modeling.

There are many kind of environmental data that can be used for modeling, many different places from which to obtain it, and many different kinds of data formats. Also, the files tend to be very large. As a result, we won't be able to cover every aspect of obtaining environmental data that may be relevant to your research--showing you how to obtain and process one particular data set won't show you how to do the same for another data set. Hence, this part of the workshop is somewhat "pre-packaged" in the sense that we've already downloaded some of the data and pre-processed it for you.

Take home: Obtaining and preparing predictor variables is often a highly manual process that can't be easily generalized. There are many places to obtain this data and many formats in which it can be obtained. In the end, though, you'll want to get it all into the same format with the same resolution and coordinate system.

Predictors for distribution/niche models

Many factors can influence a species' range and shape its niche:

•Climate

•Soils and geology

•Disturbances like fire and floods

•Other species and diseases/pathogens

•Anthropogenic land cover that can be remotely sensed like agriculture, development, forestry, and infrastructure

•Anthropogenic land use activities that are difficult to sense remotely like recreation, hunting/harvest, and selective logging

We've listed these in rough order of the availability of this kind of data (climate data is easy to find, others less so).

If you're dealing with aquatic systems, some other factors may also be relevant:

•Dissolved minerals and gases (e.g., oxygen), pH, flow rate, depth, sunlight penetration, habitat type (lotic, lentic, estuary, marine)

•In lotic systems flood frequency and intensity, plus dams and impoundments

•In lentic systems spring/fall turnover and depth

•And in marine systems salinity, storms, tides, pressure, currents, and upwelling

Finally, there are many factors "intrinsic" to a species that interact with the factors listed above to determine range and niche properties:

•Dispersal ability

•Natural selection

•Demographic processes

In general, distribution/niche modeling ignores these latter aspects and therefore assumes that species are in equilibrium with their environment. Newer techniques enable incorporating these processes into the modeling framework, but as of yet there is no overarching set methodology for doing this, and so models tend to be species-specific.

For the sake of expediency this workshop we will focus only on climate-based predictors. However, it is very worth pursuing other kinds of predictors if you feel they are relevant to modeling your species.

Climate variables

The BIOCLIM system

We often speak of "temperature" and "precipitation" as if there were only one way to measure each, when in fact there are probably hundreds of ways to measure each one of these aspects in a manner relevant to distribution/niche modeling. There are several semi-standardized systems for calculating versions of these variables. One of the most popular is the "BIOCLIM" system of variables, which was (confusingly) named after a model algorithm of the same name.

These variables are:

BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3 = Isothermality (BIO2/BIO7) (* 100)
BIO4 = Temperature Seasonality (standard deviation *100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter

BIO12 = Annual Precipitation BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter

This list was taken from the WORLDCLIM website and is easy to find if you just search for "BIOCLIM" and "WORLDCLIM"--it's nearly always the first link shown in the results. The first 11 variables pertain to temperature and the next 8 to precipitation. In the BIOCLIM system "quarters" can span any three contiguous months of the year (e.g., December-January-February).

The BIOCLIM variables attempt to encapsulate the major climatic forces that might affect a species' range. For example, many temperate plants are only able to withstand cold temperatures up to a certain point, so BIO06 (minimum temperature of the coldest month) is a appropriate variable to use for modeling their distributions. Likewise, during the winter many temperate animals depend on food stored in the summer (either as fat or external reserves), so summer primary productivity is likely relevant to their ability to accumulate food stores and body fat. In this case summer rainfall (BIO18 precipitation of the warmest quarter) is probably relevant, though BIO10 may be as well.

Obtaining climate data

As mentioned, there are many sources of climate data, though a few are used commonly in niche/distribution modeling. Please see for an incomplete listing. Below we'll describe two commonly used data sets plus a newer "up-and-coming" data set.

WORLDCLIM

Link:
What: Gridded climate reconstructions from 1-km to ~20-km resolution
Where: All of the world's land area (sans Antarctica) plus any year up to 2100 (projected)
When: Climate averaged across 1950-2000 (historic "measured") plus paleo (reconstructed) plus two periods in the 2100s (projected)
Description: WORLDCLIM was generated by interpolating data from tens of thousands of weather stations. The interpolation algorithm considers geographic position and elevation. The results are climate rasters (=gridded data) available with a cell size from 1-km to ~18-km, depending on which which version you download. WORLDCLIM has been used in thousands of studies.
Citation: Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25:1965-1978.

In this workshop we will be using the WORLDCLIM data set.

PRISM (Parameter Regression on Independent Slopes Model)

Link:
What: Weekly climate at 4-km (free) and monthly/daily at 30-arcsec (~800 m; pay for US) resolution
Where: Coterminous United States and British Columbia
When: 1895-current (historic) for monthly/weekly data and 1981-current for daily data
Description: Like WORLDCLIM, PRISM was generated by interpolating weather station data. Also like WORLDCLIM, PRISM accounts for elevation and position in its interpolation algorithm, but also attempts to capture many other kinds of factors that can affect weather/climate (e.g., distance to coast, temperature inversions, windward/leeward side of mountains, etc.). The 1-km resolution version of PRISM is the industry "standard" climate data set for work in the coterminous United States and costs about $2500 + $25 per year of data requested. The 4-km version is free.
Citation: Daly, C., Gibson, W.P., Taylor, G.H., Johnson, G.L., and Pasteris, P. 2002. A knowledge-based approach to the statistical mapping of climate. Climate Research 22:99-113.

ClimateXX

Link: but see also
Where: North America, South America, Europe
When: 1901-2009 (historic) plus any year up to 2100 (projected)
Description: "ClimateXX" stands for ClimateNA (North America), ClimateWNA (Western North America), ClimateSA (South America), and ClimateEU (Europe). The ClimateXX system uses PRISM 4-km data as a starting point then interpolates using high-order polynomials. The dates covered are from 1901 to 2009, plus future years up to 2100. Some standardized periods are available for download (e.g., 1981-2010) at 1-km resolution, but savvy users can generate their own custom climate coverages using the free ClimateNA software (for North America).
Citation: Wang, T., Hamann, A., Spittlehouse, D.L., and Murdock, T.Q. 2012. ClimateWNA - High resolution spatial climate data for Western North America. Journal of Applied Meteorology and Climatology 51:16-29.

Obtaining WORLDCLIM data

To avoid making you wade through hours of downloading, unzipping, and preparing the WORLDCLIM rasters, we've already downloaded them. However, we want you to see where they are from.

  1. Go to the WORLDCLIM web site and click "Download" then "Current".
  2. Most often you would want the 30-arcsec (~1-km) resolution data set. However, the files are huge and so processing them in R can literally take days or weeks. Hence we'll be using the 10-arcminute version which has cells ~18 km on a side at the equator.

Note that there are two versions of the "Current" data set, a "Generic Grids" version and an "ESRI grids" data set. The former are GeoTIFF rasters which can be read by almost any GIS program while the latter will work in ESRI's ArcMap and related software, plus some other platforms.

  1. We've downloaded all of the data under the line "10-arc-minutes". You can find it in the folder WORLDCLIM/1950-2000. We'll look at it later.
  2. Now please "back" out of the page you're on until you see the option to download "Future" data. Click "Future" then "10 minutes". These data represent climate predicted under the latest IPCC's emissions scenarios (the Coupled Model Intercomparison Project 5 or "CMIP5" scenarios). Please read the explanation on this page (you should be here).

Each column represents an emissions pathway or "Representative Concentration Pathway" (RCP). Each RCP is labeled with a number that corresponds the to the additional Watts of radiative forcing (= heating) that anthropogenic activities will add to the surface of the Earth by 2100. For example, under RCP8.5 CO2 emissions and other factors will add 8.5W of radiation to each m2 of the Earth (on average). RCP8.5 represents "business as usual"--warming continues unabated. RCP2.6 represents a future in which emissions are aggressively abated and even reduced come the end of the century. Realistically, given current sociopolitical trends, RCP4.5 (some abatement) and RCP8.5 (no abatement) are the most likely "bracketing" scenarios for the future of our world.

Each row in the table represents output from one atmosphere-ocean global circulation model (AOGCM). AOGCMs produce different predictions because they have different ways to represent the same process and because they represent different processes. As a result, some AOGCMs tend to predict more extreme climate in comparison to other models. Likewise, the "best" AOGCM often depends on the location--some are very good at recapitulating current climate for certain regions but not others, for example. Overall though, climatologists have found that the ensemble (=mean) model best recreates global trends (but not necessarily regional trends).

For our future climate layers we'll be using a global ensemble prediction that we calculated from 8 of the AOGCMs available on the WORLDCLIM website. We chose these 8 models because they had the full complement of data available (e.g., ACCESS1-0 doesn't have predictions for RCP2.6) and by eliminating all but one model in each "family" (e.g., there are three HadGEM models). When deciding between models in the same family, we chose the one that represents the full Earth-atmosphere-ocean process (some ignore components like some chemical reactions in the atmosphere).

To create the ensemble we downloaded each individual AOGCM's monthly predictions for min/max temperature and precipitation, for each variable averaged across AOGCMs for each month, then from the three sets of monthly values for min/max temperature and precipitation calculated the 19 BIOCLIM variables we examined above. The BIOCLIM variables were calculated using the bioclim() function in the raster package.

Note that these projections are really averages for periods centered on 2050 and 2070 even though it's convenient to call them just "2050" and "2070". That's important because predictions for any one year in the future are sure to be very different from reality, while predictions averaged across a time period are likely more accurate.

You can find the future layers in the folder WORLDCLIM/World/2061 to 2080 - RCP8pt5 - ENSEMBLE AOGCM. For this workshop we'll only use the projections for 2070.

Exploring WORLDCLIM data

Let's look at the current climate layers. First, let's look at mean annual temperature (MAT) for the current time period.

mat <-raster('./WORLDCLIM/1950-2000/World - BIOCLIM Variables/WC01.tif')
mat

## class : RasterLayer
## dimensions : 900, 2160, 1944000 (nrow, ncol, ncell)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90.00001 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
## data source : C:\ecology\Drive\Workshops\SDM from Start to Finish (WSU, 2016-04)\Series 1 GMGS\WORLDCLIM\1950-2000\World - BIOCLIM Variables\WC01.tif
## names : WC01
## values : -32768, 32767 (min, max)

This is the contents of the mat raster--you can see that it is 900 cells high by 2160 cells wide. It also has a resolution of 0.1667 or 1/6th of a degree (= 10 arcmin). The extent covers the entire world, from -180 longitude to 180 longitude (the International Date Line) and from -60 degrees south (southern the tip of South America) to 90 north (the North Pole). The raster uses the WGS84 coordinate system (a model of the Earth for designating locations--more on this later). The name of the raster in R is WC01 (not BIO01). Finally, R shows you the presumed min and max values for the raster, but in actuality these are just the min and max values that the data format in which this raster is stored can represent.

First, let's fix the min/max values so they correctly represent the highest and lowest values in the raster.

mat <-setMinMax(mat)
mat

## class : RasterLayer
## dimensions : 900, 2160, 1944000 (nrow, ncol, ncell)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90.00001 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
## data source : C:\ecology\Drive\Workshops\SDM from Start to Finish (WSU, 2016-04)\Series 1 GMGS\WORLDCLIM\1950-2000\World - BIOCLIM Variables\WC01.tif
## names : WC01
## values : -269, 314 (min, max)

You can see that the minimum value is -269 and the maximum 314. The units are 10ths of a degree, so the lowest MAT across the Earth represented by this raster is -26.9°C and the highest 31.4°C.

Let's plot the raster.

plot(mat, main='MAT')

Now, let's look at mean annual precipitation (MAP).

map <-raster('./WORLDCLIM/1950-2000/World - BIOCLIM Variables/WC12.tif')
plot(map, main='MAP')

The units for most precipitation variables in WORLDCLIM is millimeters, so the map of MAP represents precipitation ranging from no precipitation to nearly 8 m per year!

Now let's look at all of the BIOCLIM layers together (or most of them). We'll create a "raster stack" object that holds multiple rasters together that can be treated as if it were a single variable.

current <-stack(list.files('./WORLDCLIM/1950-2000/World - BIOCLIM Variables',
full.names=TRUE, pattern='.tif'))
current <-setMinMax(current) # get correct min/max values across all rasters
current

## class : RasterStack
## dimensions : 900, 2160, 1944000, 19 (nrow, ncol, ncell, nlayers)
## resolution : 0.1666667, 0.1666667 (x, y)
## extent : -180, 180, -60, 90.00001 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
## names : WC01, WC02, WC03, WC04, WC05, WC06, WC07, WC08, WC09, WC10, WC11, WC12, WC13, WC14, WC15, ...
## min values : -269, 9, 8, 72, -59, -547, 53, -251, -450, -97, -488, 0, 0, 0, 0, ...
## max values : 314, 211, 95, 22673, 489, 258, 725, 375, 364, 380, 289, 9916, 2088, 652, 261, ...

You can see that the stack object contains 19 layers, one for each BIOCLIM variable.

plot(current)

The world maps are nice, but our target species does not occur everywhere! Before we model, we'll want to clip the rasters to the appropriate spatial extent, choose which amongst them we want to use, and do some further preparation.

Handling rasters

A note on raster formats

You can see the raster formats the raster package can handle by using ?writeFormats. By default the raster package in R uses the GRD format, but we've had mixed luck opening these up in other programs.

We prefer the GeoTIFF format because it is has a small size on the disk and because it can be opened by many GIS programs. Small file size reduces processing time.

Often, raster data comes in the ASCII format, which is also compatible across GIS programs. However, ASCII rasters can be very large and as a result take a lot of time for R to process.

A note on raster processing

R is good at many things, but it's not good at handling large rasters. It can literally take weeks to manipulate a set of rasters for niche/distribution modeling. In many cases we skip my usual R-based workflow and use ArcMap or other software which can usually do things much faster. The main aspects of rasters that affect processing time are:

•Size (number of cells wide by long)

•Resolution (smaller cells = more cells)

•How many rasters are to be processed

We'll be using the 10-arcmin (~18-km) resolution WORLDCLIM rasters specifically because R can handle them quickly. Normally we suggest using rasters with smaller cell sizes to capture finer-scale local variation in climate, provided the precision of your species record coordinates are comparable to the given resolution.

Reflection

  1. For any given species in which you're interested, what climate variables (if any) would be appropriate to use for modeling? Does any empirical work inform your selection?
  2. Under what conditions would you choose to work with high-resolution (small cell size) versus low-resolution (large cell size) rasters?
  3. What non-climatic variables might be important to modeling any particular species in which you're interested? Are these kinds of data available?