- 1 -

STRATEGIES FOR THE VERIFICATION OF ENSEMBLE FORECASTS
(Laurence J. Wilson)
Environment Canada

Abstract: The subject of ensemble forecast verification naturally divides itself into two parts: verification of the ensemble distribution of deterministic forecasts and verification of probability forecasts derived from the ensemble. The verification of the ensemble distribution presents unique problems because it inevitably involves comparing the full ensemble of forecast values against a single observation at the verifying time. A strategy to accomplish this kind of evaluation of the ensemble distribution is described, along with other methods that have been used to evaluate aspects of the ensemble distribution.

By comparison, evaluation of probability forecasts extracted from the ensemble is simpler; any of the existing measures applied to probability forecasts can be used. Following the work of Murphy, the attributes of probability forecasts are described, along with a set of verification measures which are used to evaluate these attributes. Examples of application of each technique to ensemble forecasts is shown, and the interpretation of the verification output is discussed. The verification methods described include the Brier and rank probability scores, skill scores based on these, reliability tables, and the relative operating characteristic (ROC), which has been widely used in the past few years to evaluate ensemble forecasts.

1.Introduction

The subject of evaluation of ensemble forecasts naturally divides into two parts, according to the use of ensemble forecasts. Since the output of an ensemble forecast system is a distribution of weather elements, valid at each time and place, it is necessary to consider those methods that apply to the evaluation of the ensemble distribution itself. These are discussed in section 2. Then, since ensemble forecasts are often used to estimate probabilities, it is also necessary to consider methods that are used to evaluate probability forecasts. These are discussed in section 3.

2.Verification of the ensemble distribution

Until the advent of ensemble prediction systems, verification of forecasts from numerical weather prediction models involved simply matching the model forecast in space and time with the corresponding observation. With a single model run, there could usually be a one-to-one match between forecast and observation, on which numerous quantitative verification measures could be computed. An ensemble system produces a distribution of forecast values for each point in time and space, but there is still only a single observation value. The challenge of ensemble verification is to devise a quantitative method to compare the distribution against specific observation values.

As a starting point, one might consider what constitutes an “accurate” forecast distribution. What characteristics should the forecast distribution possess in order to be considered of high quality as a forecast? Two desirable characteristics of ensemble distribution, “consistency” and “non-triviality” have been stated by Talagrand, 1997 A forecast distribution is said to be consistent if for each possible probability distribution f, the a posteriori verifying observations are distributed according to f in those circumstances when the system predicts the distribution f. (Talagrand, 1997). In other words, if one could compile a sufficiently large set of similar cases, where similar distributions had been forecast by the ensemble system, then the distribution of observations for those cases should match

- 1 -

the ensemble distribution. In practice, it is nearly impossible to compile a sufficiently large enough sample of cases that are similar enough because of the large number of degrees of freedom in each distribution, so it would be very difficult to verify consistency directly.

The second desirable characteristic of ensemble forecasts stated by Talagrand is “non-triviality”. An ensemble forecast system is said to be non-trivial if it forecasts different distributions on different occasions. This is similar in some ways to the concept of sharpness in a forecast system: the system must give different forecasts for different times and locations. A system which always forecasts the climatological value of the weather element, or, the climatological distribution, would be a trivial forecast system.

2.1A probability-based score and skill score

Wilson et.al (1999) proposed a verification system for ensemble forecasts that attempts to evaluate the ensemble distribution as a basis for estimating the observation value. The concept is illustrated in Figure 1. In Figure 1, three hypothetical distributions are shown, a relatively sharp distribution that might be associated with a short range ensemble forecast, a distribution with greater variance, as might be predicted at medium range by an ensemble system, and a broader distribution which might be the climatological distribution for that date and place. The verifying observation is indicated as -3 degrees C. If +/- 1 degree C is considered a sufficiently accurate forecast for temperature in this case, one can determine the forecast probability within one degree of the observation using each of the distributions (The shaded areas on the figure). Then , this probability can be used directly as a score for the forecast. The probability determined from the climatological distribution can represent the score value for a climatological forecast, and can be used to build a skill score in the usual format:

where scoref is the score value for the forecast, and scorec is the score value for climatology. Figure 1 shows normal distribution curves. If the ensemble is small enough, better estimates of the probability can be obtained by first fitting a distribution (Wilson et. al (1999) suggest normal for temperature, gamma for precipitation and wind speed), then calculating probabilities from the distribution. Experiments with data from the 51-member ECMWF ensemble system indicated that, for ensembles of this size, it is not necessary to fit a distribution; use of the empirical ensemble distribution gave similar results.

Figure 1. Schematic representation of probability scoring system as it might be applied to ensemble temperature forecasts at a specific location. Example shows probabilities (hatched areas) for the observed value +/-1 degree C.

This type of scoring system evaluates the probability distribution in the vicinity of the observation only; the shape and spread of the distribution far from the observation value are not considered directly. The score is sensitive to both the spread and the location accuracy of the ensemble with respect to the observation: If the observation coincides with the ensemble mode, a relatively high score value can be obtained, especially if the spread (variance) of the ensemble distribution is small. On the other hand, if the forecast is missed in the sense that the observation does not lie near a mode of the ensemble, then the probability score will be low. In such cases, greater ensemble spread, corresponding to an indication of greater uncertainty can lead to higher scores than would be the case for lower ensemble spread. A perfect score (probability=1.0) occurs when all the ensemble members predict within the acceptable range; that is, the score is maximized when the ensemble predicts the observed value with confidence.

Figure 2 shows an example of the scoring system applied to a specific ensemble forecast for a specific station, Pearson International Airport, Toronto. The histogram shows the actual ensemble distribution of temperature forecasts for this location, valid May 17, 1996. A normal distribution has been fitted to the ensemble (crosses in the figure), and to the climatological temperature distribution for the valid date (circles). The verifying temperature, indicated by an X on the abscissa, was 11 degrees C. For the short range forecast, (left side), the verifying temperature is close to the mean of the distribution, giving a probability score of 0.27 of occurrence within 1 degree C of the observation. The observed temperature was near normal; the corresponding score value for climatology is 0.19. The skill score in this case, obtained from (1) is 0.11. The forecast has achieved positive skill by forecasting a sharper distribution than the climatological distribution.

The right side of figure 2 shows the fitted distributions and scores for a 7-day ensemble forecast verifying on the same day. This forecast is about as sharp as the shorter range one, (i.e., the spread of the fitted and empirical distributions are about the same as for the short range foreast, but the ensemble has totally missed the observation. All the members forecast temperatures too low. As a result, the score value is 0.0 in this case, and the climatological score is 0.19, same as before. The skill score is negative, -0.23, because the forecast has missed in a situation which is climatologically normal.

Figure 2. Verification of 72 h (left) and 168 h (right) ensemble 2 m temperature forecasts for Pearson International Airport. The histogram represents the actual ensemble distribution, the fitted normal distribution is represented by crosses, and the corresponding climatological distribution is represented by circles. The computed score (sf), skill score (ss) and climate score (sc) are all shown in the legend. The observed temperature is shown by the X on the abscissa.

For ensemble precipitation amount forecasts (QPF), it has been suggested that a gamma distribution might be more appropriate than the normal distribution (Wilson et al, 1999). Figure 3 shows an example of the probability score and skill score computed for a 72h ensemble forecast of precipitation accumulation over 12h. A gamma distribution has been fit to both the ensemble forecast and the climatological distribution of 12h precipitation amounts. The score and skill score were computed in this case using a geometric window for a correct forecast. The lower boundary of the correct range was set to 0.5 times the upper boundary, so that the forecast is considered correct if it lies between (observation/sqrt (2)) and (observation * sqrt(2)). This takes account of the fact that small differences in the predicted precipitation are more important for small amounts than for large amounts. Under this scheme, for example, ranges such as (2.0, 4.0), (5.0, 10.0) and (20.0, 40.0) might all be used, depending on the observed precipitation amount. The factor width of 2 seemed to be strict enough in tests, but other factors could be used to determine the window for a correct forecast. Tests using a smaller window factor of 1.5 indicated that the results are not strongly sensitive to the size of the window in this range. It is nevertheless important to report the selected window size with the results so that they can be interpreted.

In Fig. 3, the ensemble has indicated that some precipitation is likely, whereas the climatological distribution favours little or no precipitation. 2.0 mm of precipitation was observed on this occasion, and so the forecast shows positive skill of 0.36. The full resolution operational model (ECMWF) has also predicted precipitation, a higher amount than all the members of the ensemble. The window for a correct forecast is (1.414, 2.828), geometrically centered on the observed value of 2.0 mm. The deterministic forecast from the full resolution model lies outside this window, and would have to be assigned a score of 0.0. Thus the ensemble has provided a more accurate forecast than the full resolution model in this case.

Figure 3. Ensemble distribution (bars), fitted gamma distribution (crosses) and corresponding climate distribution (circles) for 72 h quantitative Airport. Score value (sf), climate score (sc) and skill score (ss) are given in the upper right precipitation forecast for Pearson International corner, along with the deterministic forecast from the ECMWF T213 model and the verifying observation.

The examples shown so far have been for single ensemble forecasts at specific locations. Of course, the score and skill score can be computed over a set of ensemble forecasts and used to evaluate average performance over a period and for many locations. One such experiment used the probability score to compare the performance of the ECMWF ensemble and the Canadian ensemble on the same period in 1997. Table 1 summarizes the data used in the experiment.

Table 1.Sample used in comparison of the ECMWF 51 member and Canadian 9 member

ensemble forecasts.

ECMWF Data / Canadian Data
Verification period / 151 days, Jan to May, 1997 / 148 days, Jan to May, 1997
Stations / 23 Canadian stations / 23 Canadian stations
Parameters / 2m temperature, 12h precipitation, 10m wind / 2m temperature, 12h precipitation, 10m wind
Ensemble size / 51 member ensembles / 9 member ensembles
Ensemble model / T106 model / T63 model
Projections / 10 days (12h) from 12 UTC / 10 days (12h) from 00 UTC

Figure 4 shows an example from this verification, again for Pearson International Airport. Figure 4a, for the Canadian ensemble shows that the skill remains positive to about day 6 with respect to climatology, and furthermore is asymptotic to 0 skill. One would expect this skill score to be asymptotic to 0 skill as the ensemble spread approaches the spread of the climatological distribution, and the conditioning impact of the initial state is lost It is also an indication that the model’s climatology approaches the observed climatology. (i.e., the model’s temperature forecasts are unbiased) Score values and skill is higher for forecasts verifying at 12 UTC, which suggests the Canadian ensemble system forecasts early morning temperatures near the minimum temperature time more accurately than early evening temperatures. Figure 4b, for the 51-member ECMWF ensemble, shows positive skill with respect to climatology throughout the 10-day run, though the skill is near 0 by day 7 of the forecast. The ECMWF model also exhibits a diurnal variation in the accuracy and skill, but in the opposite sense. That is, forecasts in the evening are more accurate than in the early morning. This might be expected because the ECMWF model is tuned for the European area, which has a maritime climate. Toronto’s climate is more continental in nature, with stronger nighttime cooling than might be experienced in a maritime climate.

Figure 4c shows verification results using the score and skill score for 60-member combined ECMWF and Canadian ensembles. Since the scores for the separate ensembles are similar in magnitude, combining the ensembles doesn’t have a large effect overall on the scores. The combined result seems to be as a weighted average of the two individual results. The diurnal effect has been mostly eliminated.

- 1 -

Figure 4. Verification of Canadian (a), ECMWF (b) and combined (c) ensemble 2m temperature forecasts for Pearson International Airport for January to May, 1997.

Using the score over a set of precipitation amount forecasts (12h accumulation) indicated somewhat lower skill for precipitation forecasting than for temperature forecasting. Figure 5 shows three examples of score and skill score values averaged over the five month test period, for three Canadian stations. The verification was carried out using a window factor of 2.0. At St. John’s Newfoundland, on Canada’s east coast, the skill was slightly positive until about day 2 of the forecast, then slightly negative. Once again, the score values seemed to be asymptotic for longer projections, but tending towards a negative value rather than 0. For Toronto and Winnipeg, the skill was never positive at any projection, and the Winnipeg results are poorer than the Toronto results. Both show a strong diurnal variation, which can be attributed to differences in the observed frequency of precipition in the 00 UTC and 12 UTC verifying samples. The performance differences among these stations are most likely related to the differences in the climatology of precipitation occurrence. At St, John’s, the frequency of occurrence of precipitation is relatively high; this station has a maritime climate. Toronto and especially Winnipeg have more continental climates with generally lower frequencies of precipitation occurrence. In terms of the ensemble forecast, the negative skill is likely caused by too many ensemble members forecasting small amounts of precipitation in situations when none occurs. These are the situations when the climatological distribution, which favours the non-occurrence of precipitation, will have higher accuracy. In other words, it is possible that both models are biased towards forecasting too much precipitation or forecasting a little precipitation too often, since the skill score is asymptotic to negative values. To check this, it would be worthwhile to compare the climatological distribution with the predicted distribution compiled from all the ensemble forecasts.

The above results show that the score and the skill score are quite sensitive to differences in performance of the ensemble. It is also relatively easy to interpret the score calues, even for a single forecast. The score applies to any variable for which observations exist. Figures 6 to 8 illustrate its diagnostic use for verification of 500 mb height forecasts. For this experiment, the score was calculated at every model gridpoint, using the analysis to give the verifying heights. After some experimentation with the window width, 4 dm was chosen as the best compromise between smoothness of the result and strictness of the verification. At 2 dm, every small scale deviation in the score values was visible with the result that the spatial distribution of the score was too noisy, while at 6 dm, forecasts tended to be “perfect” well into the forecast period, and spatial variations in the score did not show up at all in the shorter range forecasts.