Canadian Crop Yield Forecaster

Monday, June 16, 14

(Version V3.0)

Software Documentation

David S. Zamar &

Yinsuo Zhang

Table of Contents 2

Table of Figures 3

Preface 4

Requirements 4

System Requirements 4

Input Datasets 5

Architecture and Design 7

The CCYF Method 7

Design and Implementation 8

Technical 10

External Methods 10

Function: LoadData 10

Function: ccyf.model.selection 10

Function: ccyf.mcmc 11

Function: ccyf.lm.model.plot 12

Function: computeLMPerformance 13

Internal Methods 13

Function: fitSelectedCarModel 13

Function: fitLinearModelPerCar 14

Function: findBestNeighbours 15

Function: generateNearestNeigbourBootStrapSamples 15

Function: dBetaPosterior 16

Function: rBetaKernel 16

Function: dBetaKernel 17

Function: dY 17

Function: theta.MCMC 17

Function: sim.x 18

Function: ryNext 19

Function: sim.y 19

Function: carPrediction 19

Function: rep.row 20

Function: compute.BPR2 20

Function: computeEI 21

Function: computeCRM 21

Function: computeRMSE 22

Function: computeRRMSE 22

Function: computeMRE 23

Function: appendAR1Term 24

Function: plotCARs 24

Function: cv1.rob.adj.noscale 25

Function: rcvModelSelection 26

Function: robR2w 27

User Guide 28

Installation of Package Dependencies and Setup 28

Setting up the Forecast Data 30

Model Selection 32

Forecasting the Crop Yield 35

Summarizing the Crop Yield Forecast Results 36

Appendix 38

Key Variables 38

Frequently Asked Questions and Troubleshoot 39

References 40

Table of Figures

Figure 1: Flowchart of the CCYF Model 9

Preface

The following document is a reference manual for a simplified version of the Canadian crop yield forecaster (CCYFR2.0S) model. This version excludes the input data generating modules, thus only can be applied when all the required near real time data are collected. The document is split up into three Sections. Section 1 discusses the basic requirements of the CCYF model. Section 2 provides an overview of the software and its design. Section 3 contains the documentation of code, algorithms and interfaces. Section 4 is a manual designed for end-users and provides step-by-step instructions on how to generate results for a crop yield outlook report. The Appendix contains answers to frequently asked questions and a listing of key input variables.

Requirements

This chapter describes the software/hardware requirements of the CCYF model as well as the format and content of required input datasets.

System Requirements

The CCYF model is implemented in R and can be run on any platform, which supports R, such as Windows and MAC OS X. The software was built and tested using R version (64) 3.0.3 and utilizes several R packages, which are listed in Table (1). These packages must be installed prior to running the module. In addition, the CCYF module is made up of several R files, listed in Table (1), which must be imported into R prior to use. Please refer to Section 3 for specific instructions on setting up and running the CCYF module in R.

The following system requirements are intended to serve only as a guideline. For large datasets and long simulations more memory and a faster CPU may be necessary.

· Windows XP SP3, Windows 7 or 8, OS X Lion

· Intel Core i7 Processor @ 2.40 GHz

· 2GB RAM

Table 1: Required R Files

/ R Files / Descriptions /
1 / CCYF_3_0S.R / Main program that controls parameter setting, inputs, outputs and all the modeling proceses.
2 / InstallPackages_CCYF.R / Install all the required packages. Only required when first time run the model on a computer.
3 / LoadPackages_CCYF.R / Load the required packages for model run. Required each time start a new R session.
4 / externalMethodsCCYF.R / External functions/modules that are directly called by “CCYF3.0S.R”
5 / internalMethodsCCYF.R / Internal functions/modules that are used by external modules/functions

Input Datasets

The CCYF_V3.0S requires one input dataset which contains both model training (historical) data and near real time forecast data. The historic data include historical crop yield, harvested area, monthly aggregated agroclimate indices, and three week average remote sensing NDVI data and any other variables of interest. Observations (year and CAR) correspond to rows and variables to columns. CAR refers to the census agricultural region, in which the survey yields were released by Statistics Canada. The first row must contain the column labels. The first four columns of the dataset must be "YEAR", "CARUID", "Yield" and “Area”. The near real time (typically the current year) forecast data has the same format as the historical data. All the missing data are input as “-999”. The first few rows of a sample input dataset are shown in Table (2). The number or numbers found trailing each agroclimate variable name correspond to the month or months of the data representing, e.g., SumP_5 and SumP_58 represent total precipitation of May and May to Augusts respectively. Each NDVI related variable are prefixed with “NDVI”, the numbers or letters trailing it correspond to the Julian week numbers or the identity of the NDVI values, e.g. NDVI “28_30” represents the average NDVI value of Julian week 28, 29 and 30 while “NDVI_Max” represents the maximum NDVI value of the growing season.

Table 2: Input Dataset

Year / CARUID / Yield / Area / Seeding_JDay / SumP_5 / SumGDD_5 / AvgSI_5 / ….
1987 / 1100 / 52.2 / 9000 / 158.2 / 57.97 / 136.72 / 0.57 / …
1988 / 1100 / 46.1 / 11500 / 160 / 78.42 / 176.58 / 0.56 / ….
1989 / 1100 / 53.7 / 9500 / 153.9 / 86.57 / 226.82 / 0.57 / ….
… / …. / … / … / …. / … / … / …
2012 / 5908 / 48.3 / 71400 / 130 / 52.79 / 144.51 / 0.65 / …

The number of observations required by the CCYF method highly depends on the number of explanatory variables included in the analysis. Table (4) may be used to estimate the number of observations required per CAR when fitting a chosen model with p independent parameters and a regression coefficient value R. The values in Table (4) were computed assuming a significance level of 0.05 and a power of 0.8.

Table 4: Minimum Sample Size for Multiple Regression

R / p
1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10
0.60
0.65 / 16 / 21 / 24 / 27 / 29 / 32 / 33 / 36 / 37 / 38
12 / 17 / 20 / 23 / 24 / 27 / 28 / 31 / 32 / 33
0.70 / 10 / 13 / 16 / 19 / 20 / 23 / 24 / 25 / 28 / 29
0.80 / 8 / 11 / 14 / 15 / 16 / 19 / 20 / 21 / 24 / 25
0.85 / 6 / 9 / 10 / 13 / 14 / 15 / 16 / 17 / 20 / 21
0.90 / 4 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15
0.95 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 15

Architecture and Design

This Section provides a statistical overview of the CCYF method and gives a detailed description of how the software was designed and implemented.

The CCYF Method

The CCYF algorithm is a made up of a sequence of steps. Within each step a specific set of statistical techniques are used to accomplish a given task. The following is a breakdown of the sequence of steps that make up the CCYF algorithm and the employed statistical techniques.

1. Robust Least Angle Regression

Robust least angle regression (R-LARS) is used as an initial ranking of the explanatory variables for each CAR [4]. Currently, the top 5 R-LARS ranked variables are passed onto the next step, which performs robust cross validation to obtain a final regression model for each CAR.

2. Robust Cross Validation

Robust cross validation is iteratively performed on all subsets of the top

R-LARS ranked variables selected in the previous step [2]. The combination of variables that minimizes the median absolute error (MAE) is chosen as the best-fit model.

3. Identification of Neighboring CARs

An important property of the CCYF method is that it makes use of data (information) from neighboring CARs when conducting inference for a given CAR. The program does the identification of neighbors automatically. A CARs neighbors do not necessarily need to be physically close, but are instead required to share similarities in the correlation structure of their data. The identification of a given CARs neighbors is done by applying its chosen regression model to all potential neighbors. The neighbors are then ranked according to the predictive performance of the chosen model when applied to their data. Currently, the top 3 neighbors are selected for each CAR.

4. Markov Chain Monte Carlo

An empirical joint prior distribution of the regression model parameters for a given CAR is obtained by residual sampling from its neighboring CARs [6]. A Markov Chain Monte Carlo (MCMC) algorithm is used to sample from the posterior distribution of the regression model parameters for each CAR.

5. Random Forest Algorithm

The random forest algorithm is used to simulate the values of any unobserved variables at the time the forecast is made. Complete data is obtained by combining the simulated values for unobserved variables with the values of those variables that have already been observed.

6. Crop Yield Forecast and Corresponding MCMC Credible Intervals

For a given CAR, the posterior distribution of crop yield is obtained by evaluating the fitted regression models (sampled from the joint posterior distribution of the model parameters) on the simulated complete data. The median of the posterior distribution is used as a point estimate for the crop yield forecast, while the Monte Carlo Standard error is used to construct a credible interval.

Design and Implementation

The CCYF model is designed to forecast crop yield by taking into account as much relevant information as possible. This is why knowledge from several sources of information, such as agroclimate, remote sensing and even plant phenology may be used as input to the CCYF model. The CCYF model is meant to generate forecasts for pre-specified spatial regions, such as the Census Agricultural Regions (CARs) of Canada. The current implementation assumes that the data is at the CAR level, however other census divisions may be provided. Historic agroclimate and crop yield data for each CAR must be given as input. Crop yield data should be given for each year, whereas the agroclimate and other data (i.e. remote sensing or phenology) may be provided as monthly averages over the growing season. Figure 1 is a flowchart of the CCYF model. It illustrates the processing of information as it flows through the four core stages (input data, model selection, analysis, inference and results) of the CCYF model algorithm. The rectangular red boxes connect each algorithm stage with the R function that was designed to implement it.

Figure 1: Flowchart of the CCYF Model

Technical

This section provides detailed documentation of the CCYF model R code. Each implemented method is included along with a description of its use, input variables required and any output returned.

External Methods

The methods shown here are found in the “externalMethodsCCYF.R” file of the CCYF project folder. These methods may be called directly by the end user.

Function: LoadData

Function that is used to load the historic crop yield and agroclimate input dataset.

Arguments:

1. filename: A csv file including the crop yield data and agroclimate data for each CAR. The data for each year and for each CAR should appear on a separate row. The first row must have the column names. The first four columns must be labeled "Year", "CARUID", "Yield" and “Acres”.

2. na.action: How to handle observations with missing data (default is to omit them). Can be one of "na.omit", "na.pass", or "na.fail".

Returns:

1. dat: A dataframe object containing the data from the read-in file.

2. years: A vector of (unique) sorted years.

3. carIDs: A vector of (unique) CAR IDs.

Function: ccyf.model.selection

Function that performs model selection for each CAR.

Arguments:

1. historic.dat: The historic crop yield data (formatted and returned by the loadData method) to be used for performing model selection.

2. resp.col: The column index or label corresponding to the response variable. Should be "Yield".

3. group.col: The column index or label of the grouping column. Should be "CARUID".

4. must.include.col: The column indices or labels of those variables that must be included in the model. For example, "Year".

5. trunc.at.zero.col: The column indices or labels of those variables that should not contain negative values (i.e. are all > 0).

6. penalty.weights: The vector of weights for the variables. Weights must not be given for the response, group or variables that must be included.

7. rank: The maximum rank of the models to be selected (not including the must.include variables).

Returns:

1. car.models: A list containing the selected model for each CAR. An lmrob object is returned for each CAR. For information about an lmrob object, load the robustbase package in R and type ?lmrob in the open R session. In particular, please see the Value Section for the lmrob documentation.

Function: ccyf.forecast

Function that generates forecasts based on the mcmc chain generated by the ccyf.mcmc method and a dataset of currently observed values of the explanatory variables.