The Big Data for Official Statistics Competition Results and Lessons Learned

The Big Data for Official Statistics Competition – results and lessons learned

Bogomil Kovachev ()[1], Martin Karlberg[2], BoroNikic[3], Bogdan Oancea[4] and Paolo Righi[5]

Keywords: official statistics, big data, nowcasting, forecasting

1.Introduction

The Big Data for Official Statistics Competition (BDCOMP) was the first official statistics nowcasting competition at EU level with a big data focus. It was carried out under the framework of the European Statistical System (ESS[6]) Vision 2020 Project BIGD[7] and was organised by Eurostat in collaboration with the ESS Big Data Task Force. The BDCOMP scientific committee composed of colleagues from various member and observer organisations of the ESS.

The term nowcasting is used to signify the forecasting of a statistical indicator with extremely tight timeliness – sometimes even before the reference period is over. The need for good nowcasting techniques is currently increasing given the constant demand for releasing statistics faster.

The main purpose of BDCOMP was to evaluate different methodologies with respect to their applicability to the nowcasting process. The task for participants in the competition was to nowcast official statistics for EU Member States. The accuracy of these nowcasts is the measure of success for each nowcasting approach.

The call for participation[8], was circulated to various academic institutions and international organisations, contains many details of the competition that are beyond the scope of this abstract. The competition was additionally publicised on the Eurostat website and was also promoted at various events on related topics where Eurostat participated.

After a relatively high number of expressions of interest finally there were five participants[9] that made it through the whole competition process:

P1 –Prof. George DjolovUniversity of Stellenbosch and Statistics South Africa

P2 –Team ETLAnow: Research Institute of the Finnish Economy

P3 –JRC Team: Joint Research Centre of the European Commission

P4 –Dr. Roland Weigand

P5 –University of Warwick Forecast Team

2.Methods

2.1.Design

The competition is designed to provide the opportunity for an out-of-sample objective evaluation of nowcasting methods for official statistics. This is achieved by requiring participants to submit a forecast (i.e. nowcast) before the official number is out.

The indicator set (described below) is the basis for dividing the competition into tracks (one track corresponds to one indicator) and then further into tasks – for each indicator there is a separate series for every EU Member State (and two more for the EU and euro area aggregates). The combination of an indicator and an EU Member State is called a task (e.g. “HICP for Luxembourg” is a task).

The competition is divided into rounds – one round per month for the year 2016. Since different official statistics have different release dates each round has two submission deadlines.

2.2.Indicators used

The following indicators were part of the competition:

Track 1:Unemployment in levels

Track 2:Harmonised Index of Consumer Prices (HICP) all items

Track 3:HICP excluding energy

Track 4:Tourism – nights spent at tourist accommodation establishments

Track 5:Tourism – nights spent at hotels

Track 6:Volume of retail trade

Track 7:Volume of retail trade excluding automotive fuel

These seven indicators were chosen to fit the purposes of the competition. They can be generally considered as economically important so that there is enough interest towards them. Very important for BDCOMP is also the fact that they are monthly. This ensures that within a year there is a sufficient amount of data points for a reasonable evaluation.

2.3.Usage of big data and reproducible submissions

Despite the focus on big data we have allowed competitors to use more traditional techniques. This was made in order to enlarge participation and to provide a benchmark.

The main goal of the competition being the evaluation of methods we have encouraged the participants to submit open submissions. These are submission from which official statistics and the scientific community at large benefit the most. Unfortunately however it is not always possible (or desirable) to disclose the data used for the competition and in order to make a submission fully reproducible it is necessary to supply the methods (source code) used and the data. As a result of this not all submissions did comply with the request even if most did.

2.4.List of evaluation measures

The evaluation measures for BDCOMP were designed to provide an objective comparison of methods. As no single measure can capture all desirable properties of estimates several measures were used in the competition. The following are the official evaluation criteria for BDCOMP:

Criterion 1:Relative mean squared error

This is perhaps the most obvious measure to include in a competition of this kind. Intuitively it represents how far the submitted forecasts were from the actual outturn. The formula used is:

RMSE = 1/N * ∑Ni=1 ((Fi - Ri)/Ri)2

where Fi is the forecast for the i-th period and Ri is the official release for the same period (the benchmark)

Criterion 2:Directional accuracy

The measure is included in order to provide another look at the performance of a method – namely whether it correctly predicts the direction of change in the indicator. In order to be able to draw meaningful conclusions this measure is only applied to seasonally adjusted data or data that is officially not considered to have significant seasonality. Based on our results it seems that using directional accuracy as a sole discriminatory measure could lead to many ties due to the limited size of the set of possible values.

Criterion 3:Density estimate

The aim of including this measure was to allow participants to express a level of confidence in the prediction. The measure is designed to allow comparison between predictions of indicators that have different magnitude (e.g. unemployment levels in a big country and a small country). The formula used is:

L = (∏Ni=1Pi)(1/N)

where Pi is an appropriately modified likelihood of the official release for period i under the submitted distribution for the same period[10].

2.5.Organisational challenges during the competition

Before starting with the organisation, we had already anticipated some of the challenges that lay ahead. Naturally we weren’t prepared for everything. We list here most of what we deem important to have in mind for anyone who might be preparing to run a similar event.

We did not make registration an official requirement due to the relatively short deadline between the announcement of the competition and the first submission deadline. This lead to the fact that we did not have an established communication channel with the participants right until the start of BDCOMP. Consequently, any updates and clarifications had to be made during the competition.
Strict deadlines need to be maintained which calls for a reliable system of collecting participants’ forecasts. We have required participants to use a special submission template for ease of automatic processing. This was complied with quite consistently which allowed us to use scripts for processing[3]. We have used email which worked reasonably well with the participation that we had. A dedicated submission system would have offered some further advantages with respect to machine readability but would have caused a major issue in case of system failure shortly before a deadline.
It turned out that for two of our indicators – HICP and HICP without energy there was a change of reference year planned for 2016. We discovered this shortly after launching the competition. For the headline figure (HICP) Eurostat continued to produce numbers with the old reference year so that the official target remained unchanged. For HICP without energy however this was not the case and the benchmark (and in one case the participants’ submissions) needed to be re-referenced. This is not an ideal situation since as the re-referencing is done with officially published numbers only some precision is lost.
We have set the rule that for any indicator it is the first official release that counts as a benchmark. Eurostat’s dissemination chain is not adapted to such a requirement. Due to subsequent revisions of published numbers it is not easy to discover subsequently what the initially released figure was; we had to operate an automated daily download and in the end produced the benchmark series from there. During this process we discovered that the dissemination chain is not completely static and adjustments needed to be made frequently.
Initially we did expect rounding issues to be a problem in certain areas[11]. What we did not foresee however was that for the density estimate measure our precision requirement turns out to be inadequate. For example HICP is reported by some countries with precision of one decimal after the floating point – consequently we have requested submissions to be made with this precision. For density estimation though this turns out to be inadequate – suppose that one considers[12] that a reasonable estimate for a particular figure is 100.05 with a standard deviation of 0.05 – in effect giving equal probability of 100.0 and 100.1. Under our original rules one would have to be forced to choose between 100 and 100.1 which in effect makes density estimation useless. We had to update the rules during the competition and remove the first month from the evaluation.

3.Results[13]

3.1.Track 1 – Unemployment

Only P2 proposes a forecasting approach using a big data source (Google Trends). The team uses a seasonal AR (1) model with an exogenous contemporaneous value of the Google Index as further covariate. The model is closely related to the work shown in the seminal paper of Choi and Varian [1].

P1 introduces the robust nowcasting algorithm (RNA). The emphasis of the RNA is on robustness, i.e. flexibility in the face of imperfect data conditions, to accommodate its possible uses across different time series. At BDCOMP the RNA is applied on nowcasting the monthly Irish unemployment levels in nominal, i.e. raw and seasonallyadjusted terms.

Here and in all other tracks P4 applies a series of univariate benchmark methods which were automatically applied to the time series of interest.They were taken from the open source R package “forecast” [2] and based on the ets(Error-Trend-Seasonal or ExponenTial Smoothing) and auto.arima functions there. The model selection in both cases is done by using information criteria AIC and BIC. auto.arima chooses the best ARIMA model while ets ([3] and [4])represents a state space framework for automatic forecasting with exponential smoothing techniques which is estimating the initial states and the smoothing parameters by optimising the likelihood function.The ets function was applied with the default parameters which means an automatically selected error type, trend type and season type. Both functions were applied to all EU countries and separately on three different subsets of countries. A fifth prediction model was defined as a simple average of the predictions given by the first four models (auto.arima and ets with models chosen by AIC and BIC).

Table 1.Track 1 results

Legend: BM = benchmark, BD = a big data approach, RNA = robust nowcasting

* - BD was more than 2σ ahead

** - only track where the RNA was fielded

† - only benchmark approaches

We observe univariate (benchmark) approaches performing quite well in the point estimate accuracy measure - models were retrained automatically (not changed) every month.RNA was fielded for only one Task (IE) and it performed best with respect to directional accuracy.

The big data method of P2 only participated in the pointestimate accuracy - it was the best performing method for one of the tasks – BG.

3.2.Tracks 2 and 3: HICP and HICP excluding energy

P3 developed 20 methods for HICP prediction based on classical ARIMA modelling and also on more elaborate techniques that included econometric modelling using leading economic indicators, random forest models with data from Eurostat and The Billion Prices project or Xgboost models and applied them to 8 EU member states. For a description of the Xgboost (Extreme Gradient Boosting) algorithm an interested reader could consult [5] while the Billion Prices Project is described in [6]. The experiments carried out by P3 showed that ARIMA models gave the best results for HICP prediction. The best results for most of the countries were obtained with two models that are basically a weighted average of five ARIMA models with different length of the time series.

P5 used three basic models along with combinations of them using equal and log score weights and applied them on Euro Area countries and to FR, DE, IT and UK. A combination using many exogenous variables were used rendering the approaches by P5 “almost big data”. The three basic models were the Bayesian vector autoregressive (BVAR) described in [7], a conditional unobserved component stochastic volatilitymodel (UC-SV) described in [8] and [9] and a univariate autoregressive modelAR(p) with the optimum value of the lag chosen by BIC. For the first two models, different economic variables were used for each country as regressors.

Table 2.Track 2 results

Legend: BM = benchmark, BD = a big data approach, MV = multivariate approach

FF – Photo finish :second best within around σ/10

† - only benchmark approaches

For Track 2 the big data approach P3Approach12 performed well for several countries – UK, NL, IE, FR (point estimate accuracy). A model with many exogenous variables (“almost” big data) performed best for the EA (point estimate and density estimate accuracy), FR (density estimate accuracy).The P5 approach containing the oil price exogenous variable performed particularly well only for IT.

For Track 3 there was an unforeseen re-referencing that was announced after the launch of the competition. Consequently, the submissions for the first month had to be discarded for evaluation purposes. Big economies (EA, EU, DE, FR, IT) seem to be easier to forecast in comparison with the case of the HICP headline aggregate (Track 2). Most models with exogenous data outperformed the benchmark models for DE, EA, FR and UK. Only for IT the benchmark was better (point estimate accuracy). For directional accuracy the picture is often reverse suggesting the complementarity of the two measures.

3.3.Tracks 4 and 5: Tourism – nights spent at tourist accommodation and at hotels

A characteristic feature of this track is that the first official estimate (which is used as benchmark in all BDCOMP tasks) is not as stable – data is revised often for some countries. Moreover, IE, EL, LU and the UK were not part of this track due to data quality issues.

P3 used Eurostat data for modelling the sub-aggregates total nights spent (B06), total nights spent by residents (B04) and total nights spent by non-residences (B05). The SABRE database with information about number of booked flights in future months was used. Forecasting was done via ARIMA, Random Forest based regression and Xgboost regression. Some of the ARIMA models used B06 or B05 or B04 data and some of ARIMA models used combination of B04 and B05 data.

Table 3.Track 4 results

Legend: BM = benchmark, BD = a big data approach

FF – Photo finish :second best within around σ/10

† - only benchmark approaches

For all countries there were big data approaches used - in 8 cases for Track 4 and 3 in Track 5 they were best. A lot of variability is observable in the results: for HR σis around 50 % and some methods are off by more than 100% on average. It seems also that for some countries March which was the month of Easter in 2016 proved slightly harder to forecast.

3.4.Tracks 6 and 7- Volume of retail trade including and excluding automotive fuel

This could be regarded as the most unstable of all tracks. Data are often revised and sometimes revisions are big.

For this track only benchmark approaches (provided by P4) were fielded. An interesting observation was that for many countries the retail trade indicator without automotive fuel (Track 7) seems significantly harder to predict compared to the indicator that includes automotive fuel (Track 6). For example the best performing approach for Track6 SI has a RRMSE of 3.1% against 8.2% for Track 7. For MT the scores are 1.3% for Track 6 vs 5.5% for Track 7.

4.Conclusions

Since the use of big data for official statistics is still a relatively new development it was expected that participation would be considerably increased if the rules of the competition allowed participation of traditional forecasting approaches. This assumption seems to have been justified by the quite low proportion of approaches actually using big data. It can be further observed that big data methods have not outperformed traditional methods. This can of course be explained by the fact that macroeconomic forecasting is an established discipline with a long tradition while the introduction of big data in it is still an ongoing process.

Many technical challenges were identified in advance allowing us to be adequately prepared however many remained as explained in section 2.5. None of them proved unsurmountable though, so the competition could be carried out until the end and the main objectives were achieved.

Concerning the practical aspects of organising a competition of this kind several remarks are due here:

Attracting participants who are dedicated enough to commit to twelve months of submitting with a strict deadline is a major challenge. The five participants that BDCOMP actually managed to retain seem in this respect to be a reasonable number.
As one can observe in the results each measure can be used to consider the results from a different angle which speaks in favour of maintaining the variety in the evaluation. The fact that there is no final single winner does perhaps diminish the competitiveness aspect somewhat but this seems inevitable due the reasons described above.
From a scientific viewpoint it would have been much better to perform the evaluation against data of a more mature vintage than a first release. However this would have implied a big timing gap between the end of the competition and the evaluation. Alternatively the evaluation could have been made against the latest available release for each month. One approach could be to perform a study on revisions of the indicators one plans to include and to select appropriate vintages per indicator.
It is our belief that the official scoring and the analysis done here are only part of the insight that can be gained from this event. Since the data and a lot of the approaches are openly available they are easy to analyse further. As mentioned above one obvious possible extension would be to wait until final releases are published for all indicators for the whole 2016 – a better ground truth - and do the scoring again. Clustering the approaches according to other criteria and evaluating the performance of the clusters is another possible extension.

References

[1]Choi H., Varian H. R Predicting the Present with Google Trends, Economic Record, (2012). 88(s1), 2–9.