The Statistical Evaluation of the ESPAM2 Model

Maxine Dakins

Environmental Science Program and Idaho Water Resources Research Institute

University of Idaho

1776 Science Center Drive, Suite 306

Idaho Falls, ID 83402

Phone:208-282-7957

Fax: 208-282-7950

Introduction

This project is aimed at helping to evaluate the fit of the Eastern Snake Plain Aquifer Model 2 (ESPAM2). This purpose of this project is to identify appropriate statistical measures of model fit that will allow the two year evaluation period (2009-2010) to be compared to the 29 year calibration period (1980-2008). This is a process often known as model validation but here called model evaluation. Its purpose is to evaluate the fit of the model to data that were not used in model calibration.

Root Mean Squared Error (RMSE)

A common statistical measure of model fit is the root mean squared error (RMSE) (Hill and Tiedeman). The RMSE is calculated from the objective function for the calibration, the sum of the squared errors (SSE). If weighted residuals were used in the calibration, then the same weighting scheme should be used in the calculation of the calibration and evaluation RMSEs. In this case, they will be weighted RMSEs but just referred to here as the RMSE.

SSE = Sum of the squared errors (weighted as done in calibration)

RMSE = Root mean squared error = sqrt(SSE/df)

where

df = n – p

n = number of data points and other points fixed points used in the calibration

p = number of parameters fitted in the regression

The advantage of the RMSE over simply using the SSEs is that by taking the square root, the value has the same units as the data rather than squared units. Smaller RMSEs result from smaller SSEs and indicate a better model fit. Larger RMSEs would indicate a worse model fit.

Since the evaluation period is a two-year period and the calibration period was 29 years, comparing the RMSE for the evaluation period against a single RMSE calculated from the entire calibration period may not be the best way to evaluate the fit of the model. Issues with that approach include having only two points to compare. With two points, one of them will be higher than the other even if the actual underlying fit is equivalent. It will be extremely difficult to interpret the meaning if the evaluation period RMSE is higher than the calibration period RMSE. If they are truly equivalent, this would be expected to happen 50% of the time. In addition, the RMSE of the model fit may vary through time if the model fit is related to periods of wet and dry weather or any other variable that changes through time.

The approach suggested is to calculate the RMSE for non-overlapping, two year periods throughout the calibration period. For example, calculate model fit for 1980 and 1981, for 1982 and 1983, 1984 and 1985, and so on. Using non-overlapping time frames will avoid having the RMSE values not be independent of each other. Using non-overlapping time periods will make for the cleanest comparison and is the recommended approach. This will result in 14 RMSE values for the calibration period.

If more values are desired, overlapping time periods could be used. This would involve defining time periods as 1980 and 1981, 1981 and 1982, 1982 and 1983, and so on. Of course, values calculated in this way will not be independent of each other as most years will be used in the calculation of two RMSEs. If desired, RMSE’s could be calculated both ways for comparison purposes. This approach would result in 28 RMSE values for the calibration period.

Once calculated, the RMSE for the evaluation period of 2009 and 2010 can be compared to the distribution of RMSE values from the calibration period. An evaluation RMSE that fall within the range of calibration RMSE values would be considered acceptable and not invalidating the model. An evaluation RMSE falling outside of the range of calibration RMSEs on the high side would be considered troubling and might point to issues with the model evaluation.

A positive characteristic of the RMSE is that it is consistent with the objective function, they are both based on the weighted sum of squared errors. A downside to this statistical measure of model fit is that outliers or extreme deviations between the model and the measurement have a great deal of influence as they are squared before being summed. This means that a small number of large deviations can exert a strong influence on the outcome of the calculation.

Median Absolute Deviation

The median absolute deviation (MAD) is a robust measure of statistical dispersion (Helsel and Hirsch). Robust means that it is still a good measure even if assumptions regarding the data distribution are not met and it is unaffected by a small number of outliers or extreme deviations between the model and the measurement.

The median of a set of data values is the 50th percentile of the data, it is the value which exceeds 50% of the values and is exceeded by 50% of the values. The simplest way to determine a median is to sort the list of values from smallest to largest. In a data set with an odd number of values, there will be a unique median. In a data set with an even number of values, it is the average of the two middle values.

The median absolute deviation is the median of the absolute values of the deviations from the data’s median.

MAD = median (|Xi – median(Xj)|)

Where Xi = the ith data point in the set

median(Xj) = the median of all of the data values in the set

|| = the absolute value of the deviations

Because the median is the middle value of a sorted list of data (or the average of the two middle values if the number of data points is even), it is unaffected by a few large values at the high end of the data set. So, if some unusually large deviations between the model and the measurement exist, the median of the absolute values of the deviations will not be affected.

Other Possible Measures to Consider

Coefficient of Determination (R2)

Another commonly used measure of model fit is the coefficient of determination R2.

R2 = 1 – SSE/SST where

SST = total sum of the squares = sum of the squared deviations of each point from the mean

The Coefficient of Determination R2 varies from 0 to 1 but is generally multiplied by 100 and expressed as a percentage. Its interpretation is the percent of the variability in the data that is explained by the model. Values closer to 100% are, of course, better than lower values.

Values for R2 could be calculated for each time period as described in the section on the RMSE. This would result in 14 independent values for the calibration period and one value for the evaluation period. The comparison would be as described above, the evaluation R2 would be expected to lay within the range of calibration R2 values.

Interquartile Range

Another robust estimate of dispersion is the interquartile range (IQR), the difference between the 75th percentile of the data values and the 25th percentile (Helsel and Hirsch).

IQR = P75 – P25

Where P75 = the upper quartile, the data value which exceeds 75% of the data and is exceeded by 25% of the data

Where P25 = the lower quartile, the data value which exceeds 25% of the data and is exceeded by 75% of the data

A strength of the IQR, like the median, is that it is not influenced by the 25% of the smallest values and, more importantly, it is not influenced by the 25% of the largest deviations between the model and the measurements. So, like the MAD, if some unusually large deviations between the model and the measurement exist, the IQR will not be affected by them.

Graphical Approaches to Evaluating Model Fit Using Residuals

In addition to the graphs already generated, a composite plot of all residuals versus time might be informative. It would help to answer the question of how does the fit of the model vary over the calibration period. If there are significant differences through time, it might raise a question about how the model fits in times of normal precipitation versus times of high or low precipitation.

Another useful model evaluation plot is of the residuals against the model simulated values. This plot helps to determine if heteroscedasticity exists, that is, if the variance of the residuals is related to the simulated values.

Finally, addressing the question raised above might be of interest, how does the model fit during times of differing precipitation? This could be addressed by plotting the statistical measures of goodness-of-fit against a measure of precipitation such as total recharge (a model input) or an independent value such as Surface Water Supply Index (SWSI). It might be informative and help to understand the causes behind model lack-of-fit.

Recommended Approach

I recommend using the RMSE and MAD as the measures of model fit and calculating them using the same weighting scheme as used in model calibration for successive, non-overlapping two year periods. Once calculated, the value for the evaluation period should be compared to the set of values from the calibration period. An evaluation value that falls within the range of calibration values would be considered acceptable and not invalidate the model. An evaluation value falling outside of the range of calibration values on the high side would be considered troubling and require additional work to resolve.

References

Helsel, D. R. and R. M. Hirsch. 1993. Statistical Methods in Water Resources. Elsevier, NY, NY.

Hill, Mary C. and Claire R. Tiedeman. 2007. Effective Calibration of Environmental Models with Analysis of Data, Sensitivities, Predictions and Uncertainty.