Alaska Biome Shifts
Strategy and Planning Meeting
October 20th 2010
USFWS office, Anchorage AK
In attendance:
Karen Murphy, USFWS
Evie Whitten, TNC
Nancy Fresco, SNAP, UAF
Michael Lindgren, EWHALE, UAF
Joel Reynolds, USFWS
Joel Clement, Wilburforce
Via teleconference:
Falk Huettmann
Synopsis
The group went over the results of modeling efforts thus far, including difficulties imposed by the requirements of the project, the limitations of the software, and the limitations of the hardware. We discussed potential fixes and compromises for these difficulties.
Overview of modeling efforts to date
Michael presented the latest clustering efforts, which now include all of Alaska, with a range of from 4 to 10 clusters. He also created projections in RandomForest based on these clusters. From a subjective perspective, these results look good. The clusters appear logical, based on our “expert knowledge” of the landscape. At the lower cluster numbers, the landscape is separated into arctic, boreal, western coastal, southeastern and Aleutian regions; higher cluster numbers create more diverse categories that appear to account for mountain ranges and other known features. There appear to be some reasonable congruence with some of our land cover maps, and with the biome classifications used in Phase I of the project, although this has yet to be analysed mathematically. However, Canada has not yet been included in clustering efforts, and there are concerns about resolution. Although projections were made at 2km, the clustering was done at a resolution of10 km (essentially squares of 25 2 km pixels each). Michael did 25 repeats to try to overcome this problem. Even this was a vast improvement over previous efforts, which resulted in continual computer crashes. Michael is now getting help from Dustin Rice (SNAP IT person) and is using a combination of SNAP and EWHALE computing power in order to maximize RAM, pagefile capacity, processing speed and capacity, and file storage space.
Issues of scale and resolution
A lengthy discussion ensued regarding resolution and modeling limitations. Why can’t we create clusters at 2km?
Karen expressed concern regarding the utility of the eventual outputs of the project. She pointed out that our goal is to create projections that are useful to land managers, particularly from the perspective of managing protected areas and refuges. These managers may need to decide whether existing protected areas are adequate, and provide landscape connectivity. Even 5 km grids are rather broad for this kind of assessment, and create “smeared” edge effects. At this scale, we are talking about big landscapes that don’t fit into conservation land, and may be poor indicators of what’s in a refuge or not in a refuge. A great deal of detail is lost.
Michael, Joel, and Falk together helped explain the reasons for the limitations, and the group discussed work-around options. The problem is not in the projections using RandomForest, but in the creation of clusters. This is because of the way clusters are defined and created.
Basics of clustering (see also ppt)
• Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.
• The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pairwise distances between observations.
• We are clustering using the PAM (partitioning around medoids) algorithm, which operates on the dissimilarity matrix of the given data set. The dissimilarity matrix (also called distance matrix) describes pairwise distinction between objects.
• The algorithm Pam first computes representative objects, called medoids (number of medoids depends on number of clusters). A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. After finding the set of medoids, each object of the data set is assigned to the nearest medoid.
• PAM is a robust algorithm, because it minimizes a sum of dissimilarities, and is thus not strongly skewed by outliers. The medoid for each cluster will be more like a median than a mean (thus the name).
• Thus, in order to cluster using PAM, the algorithm must compare every data point (grid cell) with every other one. As the number of grid cells increases with higher resolution maps, the effect is exponential, not merely multiplicative. At 2 km this is a VAST undertaking. Even with all Michael, Falk, and Dustin’s efforts, it may simply be impossible
Scale of Canadian data
Nancy reminded everyone that for much of Canada (all except for the Yukon and BC) PRISM data is unavailable, meaning that the best resolution for SNAP data is 10 minutes lat/long. There was some confusion about what this translates to in km. Nancy checked later, and found that it is approximately an 18.4 km by 18.4 km grid. The only other option for historical data for Canada would be to use the dataset we used last time, based on mean values for each ecoregion, and artificially imposing a grid on this data. However, the SNAP data downscaled with CRU is the only dataset available for future projections. The historical CRU are the obvious choice for cluster modeling, since they are on the same grid as these future projections, and will provide better regularity, wider choices for baseline years, and in many cases better resolution than the ecoregion data.
Testing multiple methods to accommodate scale issues
Perhaps clustering at some lesser resolution than 2 km will be perfectly adequate. If this is the case, it will save a lot of time and trouble, and lessen concerns over trans-boundary scale incongruity. It will also provide feedback for the Canadian project, which will use much coarser resolution (18 km) by default.
The group agreed that we should test various methods:
1) Run all the 2 km pixels, on a smaller area of the state
2) Run a random subset, choosing one 2 km grid cell from each 10 km by 10 km square
3) Run a regular subset, e.g. the southeast pixel from each 10 by 10 square
4) Average all the 2 km pixels in a 10 km by 10 km grid to create a coarser grid.
We agreed that Michael would do a test run using all four approaches within a relatively small area in the southeastern portion of the “main” part of the state (Tok/Valdez/Cordova/McCarthy area). Each of these test runs would be for only an 8-cluster model, for the sake of simplicity. Every run would use the same area. For method #2 and method #3, multiple runs would be done – ideally 25 runs each, such that the total area sampled would add up to the complete area in question. It was also suggested that Michael run a test at 50 km resolution. However, given the resolution of the Canadian data, it might make more sense to run a test at 18km (or 10 minutes).
Analyzing clusters
There was some discussion of how to compare or combine these results, since each clustering attempt is individual, and results cannot really be averaged. However, it was agreed that results from multiple runs and multiple methods should be similar enough that it should be easy to see which clusters are analogous, meaning that they can then be matched up mathematically according to % same vs % different. Results can also be analyzed subjectively, simply by looking at the resulting maps to see if they appear similar or different.
We discussed various ways to look at clusters. In the last project iteration, we created scatter plots to compare several biomes in the context of two variables. We also created box plots for each variable and each biome. With 24 variables instead of 4, it will be much harder to “see” the comparisons between clusters. However, we agreed that Michael should create box plots for every variable and every cluster (numbering clusters rather than naming them) for his 8-cluster pilot runs. He should make sure that the clusters are ordered the same for every box plot, so that a quick comparison will show which of the 24 variables are causing certain clusters to be considered “different” from others.
Baseline years
We agree that because1971-2000 are the most recent three decades available, and because working with complete decades aids in simplifying data and matching with other studies, these years could be used as a baseline for climate data, for the purposes of clustering. There was some discussion of whether these years were unusually warm, compared to years prior to 1970, but we decided that since climate is a moving target, it would be hard to find any period that would be unassailable.
Scope: How much of Canada to include?
There was a discussion of whether we should try to do the AK modeling and Canadian modeling all as one linked effort. It was agreed that this makes the most sense, however there is also concern that we don’t want to sacrifice resolution for AK, if programming constraints would limit us, and we also don’t want to confuse the model with ecosystems that are very unlikely to come up in to Alaska, such as Atlantic or Hudson Bay. Another concern was the Arctic Shield, where granitic bedrock drives the vegetation as much as – or more than – climate.
Either way, we will need to find “break points” that are defensible, since even the Canadian model doesn’t need far eastern ecosystems. SNAP only has projection data for these areas:
After the meeting, Nancy and Karen looked at Ecozone and Ecoregion maps for Canada (see below) and decided that it would be logical to exclude the Hudson Plains (driven by proximity of the Bay); as well as the eastern halves of the Taiga Shield and Southern Arctic; the Northern Arctic; the Arctic Cordillera; the Atlantic Maritime; and the Mixedwood Plains. We thought we should include part of the Boreal Shield, with the break point determined along ecoregion lines, including regions 87, 88, 89, and possibly 90, 91, and 95, but excluding the remainder, due to lake effects or simply being too far east.
We might also come up with a logical break point based on literature review.
Next steps and homework
We agreed that we need to hold a December meeting during the week of the 13th. This meeting would be a full day for the core team, with others joining for as long as possible – perhaps only a half day. The morning of the 16th is not good for Karen. This meeting will include the Canada team via teleconference (or in person if possible). At this meeting we will run past all the clustering methods and decide on the steps forward, including ensemble approach, different models, and different emission levels. Nancy will schedule this meeting, with feedback from all potential participants.
Suggested participants: Michael, Falk, Karen, Joel, Evie, Dave Verbyla, Tom Paragi, Dave Douglas, Wendy Loya, Philip Martin (or other from Arctic LCC perhaps Jennifer Jenkins?), Jennifer Barnes (ne Allen)?, Dustin Rice (SNAP IT), Canadian reps (everyone on Evie’s email from last week), plus Troy Hegel (Whitehorse – a former student of Falk’s).
Nancy will work on a ppt for next Tuesday morning’s phone meeting with the Canadian group, and will get a draft of these notes to Evie by COB Friday. She will call Evie Monday at 10:00
We will have a group teleconference Weds 27th at 10:00, and Nancy will send a reminder to the group, including the call-in number.
Nancy will talk to Dustin at SNAP about getting Michael a space to work every Friday at the SNAP office – this has already been done. Nancy will try to find out more about Rich Boone’s work on the Saskatoon system ending up in AK – based on soil science? Nancy will email the news links to everyone from press coverage for phase I of the project.