Monday, June 16, 14
Canadian Crop Yield Forecaster
(Version V3.0)
Software Documentation
By
David S. Zamar &
Yinsuo Zhang
Table of Contents
Table of Contents 2
Table of Figures 3
Preface 4
Requirements 4
System Requirements 4
Input Datasets 5
Architecture and Design 7
The CCYF Method 7
Design and Implementation 8
Technical 10
External Methods 10
Function: LoadData 10
Function: ccyf.model.selection 10
Function: ccyf.mcmc 11
Function: ccyf.lm.model.plot 12
Function: computeLMPerformance 13
Internal Methods 13
Function: fitSelectedCarModel 13
Function: fitLinearModelPerCar 14
Function: findBestNeighbours 15
Function: generateNearestNeigbourBootStrapSamples 15
Function: dBetaPosterior 16
Function: rBetaKernel 16
Function: dBetaKernel 17
Function: dY 17
Function: theta.MCMC 17
Function: sim.x 18
Function: ryNext 19
Function: sim.y 19
Function: carPrediction 19
Function: rep.row 20
Function: compute.BPR2 20
Function: computeEI 21
Function: computeCRM 21
Function: computeRMSE 22
Function: computeRRMSE 22
Function: computeMRE 23
Function: appendAR1Term 24
Function: plotCARs 24
Function: cv1.rob.adj.noscale 25
Function: rcvModelSelection 26
Function: robR2w 27
User Guide 28
Installation of Package Dependencies and Setup 28
Setting up the Forecast Data 30
Model Selection 32
Forecasting the Crop Yield 35
Summarizing the Crop Yield Forecast Results 36
Appendix 38
Key Variables 38
Frequently Asked Questions and Troubleshoot 39
References 40
Table of Figures
Figure 1: Flowchart of the CCYF Model 9
Preface
The following document is a reference manual for a simplified version of the Canadian crop yield forecaster (CCYFR2.0S) model. This version excludes the input data generating modules, thus only can be applied when all the required near real time data are collected. The document is split up into three Sections. Section 1 discusses the basic requirements of the CCYF model. Section 2 provides an overview of the software and its design. Section 3 contains the documentation of code, algorithms and interfaces. Section 4 is a manual designed for end-users and provides step-by-step instructions on how to generate results for a crop yield outlook report. The Appendix contains answers to frequently asked questions and a listing of key input variables.
Requirements
This chapter describes the software/hardware requirements of the CCYF model as well as the format and content of required input datasets.
System Requirements
The CCYF model is implemented in R and can be run on any platform, which supports R, such as Windows and MAC OS X. The software was built and tested using R version (64) 3.0.3 and utilizes several R packages, which are listed in Table (1). These packages must be installed prior to running the module. In addition, the CCYF module is made up of several R files, listed in Table (1), which must be imported into R prior to use. Please refer to Section 3 for specific instructions on setting up and running the CCYF module in R.
The following system requirements are intended to serve only as a guideline. For large datasets and long simulations more memory and a faster CPU may be necessary.
· Windows XP SP3, Windows 7 or 8, OS X Lion
· Intel Core i7 Processor @ 2.40 GHz
· 2GB RAM
Table 1: Required R Files
/ R Files / Descriptions /1 / CCYF_3_0S.R / Main program that controls parameter setting, inputs, outputs and all the modeling proceses.
2 / InstallPackages_CCYF.R / Install all the required packages. Only required when first time run the model on a computer.
3 / LoadPackages_CCYF.R / Load the required packages for model run. Required each time start a new R session.
4 / externalMethodsCCYF.R / External functions/modules that are directly called by “CCYF3.0S.R”
5 / internalMethodsCCYF.R / Internal functions/modules that are used by external modules/functions
Input Datasets
The CCYF_V3.0S requires one input dataset which contains both model training (historical) data and near real time forecast data. The historic data include historical crop yield, harvested area, monthly aggregated agroclimate indices, and three week average remote sensing NDVI data and any other variables of interest. Observations (year and CAR) correspond to rows and variables to columns. CAR refers to the census agricultural region, in which the survey yields were released by Statistics Canada. The first row must contain the column labels. The first four columns of the dataset must be "YEAR", "CARUID", "Yield" and “Area”. The near real time (typically the current year) forecast data has the same format as the historical data. All the missing data are input as “-999”. The first few rows of a sample input dataset are shown in Table (2). The number or numbers found trailing each agroclimate variable name correspond to the month or months of the data representing, e.g., SumP_5 and SumP_58 represent total precipitation of May and May to Augusts respectively. Each NDVI related variable are prefixed with “NDVI”, the numbers or letters trailing it correspond to the Julian week numbers or the identity of the NDVI values, e.g. NDVI “28_30” represents the average NDVI value of Julian week 28, 29 and 30 while “NDVI_Max” represents the maximum NDVI value of the growing season.
Table 2: Input Dataset
Year / CARUID / Yield / Area / Seeding_JDay / SumP_5 / SumGDD_5 / AvgSI_5 / ….1987 / 1100 / 52.2 / 9000 / 158.2 / 57.97 / 136.72 / 0.57 / …
1988 / 1100 / 46.1 / 11500 / 160 / 78.42 / 176.58 / 0.56 / ….
1989 / 1100 / 53.7 / 9500 / 153.9 / 86.57 / 226.82 / 0.57 / ….
… / …. / … / … / …. / … / … / …
2012 / 5908 / 48.3 / 71400 / 130 / 52.79 / 144.51 / 0.65 / …
The number of observations required by the CCYF method highly depends on the number of explanatory variables included in the analysis. Table (4) may be used to estimate the number of observations required per CAR when fitting a chosen model with p independent parameters and a regression coefficient value R. The values in Table (4) were computed assuming a significance level of 0.05 and a power of 0.8.
Table 4: Minimum Sample Size for Multiple Regression
R / p1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10
0.60
0.65 / 16 / 21 / 24 / 27 / 29 / 32 / 33 / 36 / 37 / 38
12 / 17 / 20 / 23 / 24 / 27 / 28 / 31 / 32 / 33
0.70 / 10 / 13 / 16 / 19 / 20 / 23 / 24 / 25 / 28 / 29
0.80 / 8 / 11 / 14 / 15 / 16 / 19 / 20 / 21 / 24 / 25
0.85 / 6 / 9 / 10 / 13 / 14 / 15 / 16 / 17 / 20 / 21
0.90 / 4 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15
0.95 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 15
Architecture and Design
This Section provides a statistical overview of the CCYF method and gives a detailed description of how the software was designed and implemented.
The CCYF Method
The CCYF algorithm is a made up of a sequence of steps. Within each step a specific set of statistical techniques are used to accomplish a given task. The following is a breakdown of the sequence of steps that make up the CCYF algorithm and the employed statistical techniques.
1. Robust Least Angle Regression
Robust least angle regression (R-LARS) is used as an initial ranking of the explanatory variables for each CAR [4]. Currently, the top 5 R-LARS ranked variables are passed onto the next step, which performs robust cross validation to obtain a final regression model for each CAR.
2. Robust Cross Validation
Robust cross validation is iteratively performed on all subsets of the top
R-LARS ranked variables selected in the previous step [2]. The combination of variables that minimizes the median absolute error (MAE) is chosen as the best-fit model.
3. Identification of Neighboring CARs
An important property of the CCYF method is that it makes use of data (information) from neighboring CARs when conducting inference for a given CAR. The program does the identification of neighbors automatically. A CARs neighbors do not necessarily need to be physically close, but are instead required to share similarities in the correlation structure of their data. The identification of a given CARs neighbors is done by applying its chosen regression model to all potential neighbors. The neighbors are then ranked according to the predictive performance of the chosen model when applied to their data. Currently, the top 3 neighbors are selected for each CAR.
4. Markov Chain Monte Carlo
An empirical joint prior distribution of the regression model parameters for a given CAR is obtained by residual sampling from its neighboring CARs [6]. A Markov Chain Monte Carlo (MCMC) algorithm is used to sample from the posterior distribution of the regression model parameters for each CAR.
5. Random Forest Algorithm
The random forest algorithm is used to simulate the values of any unobserved variables at the time the forecast is made. Complete data is obtained by combining the simulated values for unobserved variables with the values of those variables that have already been observed.
6. Crop Yield Forecast and Corresponding MCMC Credible Intervals
For a given CAR, the posterior distribution of crop yield is obtained by evaluating the fitted regression models (sampled from the joint posterior distribution of the model parameters) on the simulated complete data. The median of the posterior distribution is used as a point estimate for the crop yield forecast, while the Monte Carlo Standard error is used to construct a credible interval.
Design and Implementation
The CCYF model is designed to forecast crop yield by taking into account as much relevant information as possible. This is why knowledge from several sources of information, such as agroclimate, remote sensing and even plant phenology may be used as input to the CCYF model. The CCYF model is meant to generate forecasts for pre-specified spatial regions, such as the Census Agricultural Regions (CARs) of Canada. The current implementation assumes that the data is at the CAR level, however other census divisions may be provided. Historic agroclimate and crop yield data for each CAR must be given as input. Crop yield data should be given for each year, whereas the agroclimate and other data (i.e. remote sensing or phenology) may be provided as monthly averages over the growing season. Figure 1 is a flowchart of the CCYF model. It illustrates the processing of information as it flows through the four core stages (input data, model selection, analysis, inference and results) of the CCYF model algorithm. The rectangular red boxes connect each algorithm stage with the R function that was designed to implement it.
Figure 1: Flowchart of the CCYF Model
Technical
This section provides detailed documentation of the CCYF model R code. Each implemented method is included along with a description of its use, input variables required and any output returned.
External Methods
The methods shown here are found in the “externalMethodsCCYF.R” file of the CCYF project folder. These methods may be called directly by the end user.
Function: LoadData
Function that is used to load the historic crop yield and agroclimate input dataset.
Arguments:
1. filename: A csv file including the crop yield data and agroclimate data for each CAR. The data for each year and for each CAR should appear on a separate row. The first row must have the column names. The first four columns must be labeled "Year", "CARUID", "Yield" and “Acres”.
2. na.action: How to handle observations with missing data (default is to omit them). Can be one of "na.omit", "na.pass", or "na.fail".
Returns:
1. dat: A dataframe object containing the data from the read-in file.
2. years: A vector of (unique) sorted years.
3. carIDs: A vector of (unique) CAR IDs.
Function: ccyf.model.selection
Function that performs model selection for each CAR.
Arguments:
1. historic.dat: The historic crop yield data (formatted and returned by the loadData method) to be used for performing model selection.
2. resp.col: The column index or label corresponding to the response variable. Should be "Yield".
3. group.col: The column index or label of the grouping column. Should be "CARUID".
4. must.include.col: The column indices or labels of those variables that must be included in the model. For example, "Year".
5. trunc.at.zero.col: The column indices or labels of those variables that should not contain negative values (i.e. are all > 0).
6. penalty.weights: The vector of weights for the variables. Weights must not be given for the response, group or variables that must be included.
7. rank: The maximum rank of the models to be selected (not including the must.include variables).
Returns:
1. car.models: A list containing the selected model for each CAR. An lmrob object is returned for each CAR. For information about an lmrob object, load the robustbase package in R and type ?lmrob in the open R session. In particular, please see the Value Section for the lmrob documentation.
Function: ccyf.forecast
Function that generates forecasts based on the mcmc chain generated by the ccyf.mcmc method and a dataset of currently observed values of the explanatory variables.