Nowcasting and model selection in the context of income inequality indicators
The indicators on poverty and income inequality based on the European Union Statistics on Income and Living Conditions (EU-SILC) are an important part of the toolkit for the European Semester. Currently, the indicators on income of year N are only available in the autumn of year N+2, which comes too late for the EU’s policy agenda. Therefore, an improvement in this area represents an important priority for monitoring the effectiveness of social policies at EU level.
The objective of the exercise is to produce flash estimates for income and poverty indicators collected through EU-SILC 6 months after the reference period but still well ahead the deadline for the regular indicators. The starting point is the main income distribution indicators used in the frame of Europe 2020 and in European Semester: at-risk-of poverty rate and the income quintile share ratio (QSR or S80/S20). However, an analysis of the trends in EU-SILC (2008-2014) showed that these indicators are rather stable and the observed yearly changes are often not significant. In terms of early estimates there can be other indicators which are more sensitive: i.e. important shifts at different points of the income distributions throghout the crisis are captured by what we call "positional indicators" (i.e. at-risk-of-poverty threshold (ARPT) and deciles). Therefore, we included both inequality indicators such as AROP and the evolution of income at different points of the income distributions in our analysis.
In general terms the flash estimates on income distribution and poverty have the following characteristics:
− Refer to a past yearly reference period (year N) and they should be available in June N+1.
− Refer to a set of distributional indicators. This implies the use of models that allow the estimation of the entire distribution and capture the complex interaction of a large number of various past and present events, such as shifts in macroeconomic circumstances and the effects of social policies.
− Are based on an information set that includes the latest income data available from EU-SILC (income N-1 or N-2 even) that is the reference data source in establishing EU poverty statistics. This will be enhanced with more timely auxiliary information from the reference period (year N) such as Labour Force Survey (LFS), National Accounts, etc.;
− Are based on a set of statistical techniques that differ substantially from the regular production process of EU-SILC. , such as calibration, modelling, extrapolation that are not traditionally used in the calculation of social statistics indicators. Several methods have been tested from microsimulation techniques, dynamic factor models and more naïve forecasting techniques.
The focus of this paper is on the quality assessment framework put in place in order to
1) select the best model(s) to be used for the production of flash estimates at European level. The model selection algorithms need to take into account several aspects: the historical empirical performance in terms of model out-of sample errors (based on the simulation of a time series for the past), the uncertainty interval around the point estimate, the ability of the model to capture the inter-dynamics between indicators coming from the same distribution (coherence across indicators), as well as other "softer" criteria such as model interpretability, accessibility and clarity, relevance for the users (e.g. microsimulation allows to make the link between expected changes and specific policies). The system put in place is a self-learning mechanism with the best model by country to be selected among a chosen subset of competing models at each evaluation period.
2) assess the quality of final estimates in view of publication. This QAF aims to provide a common platform to assess Eurostat and national estimates. First experimental data was produced for Flash Estimates 2015 and discussed with main users and member states. One proposal concerning the publication of the figures in 2017 is to provide in this experimental stage only a magnitude direction scale for the expected change rather than a point estimate. This would take into account the uncertainty around the point estimate, both from the model and the sample variance.
The paper is structured in three main parts. First, we provide the description of the methodology, including the main nowcasting techniques, the performance metrics and the selection algorithms. Second, we present the results based on two method selection strategies in the context of flash estimates on income indicators. Third, we conclude with further considerations related to the improvement of the framework including criteria that go beyond empirical performance. Economic interpretation, clarity for the users, and coherence of indicators that characterize the distribution would need to be considered in order to enter the toolkit of policy makers in the social field.
2.1. Methods used for nowcasting
Two main methodological approaches have been tested in the frame of the nowcasting exercise: (1) microsimulation techniques; (2) econometric modelling of the quantiles of the income distribution through the so-called parametric quantile method (pqm). In addition a third approach is based on basic models to be used as benchmarks in the estimation.
The first approach is in line with current practices in different Member States and it aims to micro-simulate income changes at individual/household level within EU-SILC microdata. The EU tax-benefit microsimulation model EUROMOD is used for this purpose in combination with timelier macro-level statistics on changes in demographics, employment characteristics and income. For the purposes of the nowcasting exercise standard EUROMOD policy simulation routines are enhanced with additional adjustments to the input data to take into account changes in the population structure and labour market characteristics. Micro-simulation models are very powerful in accounting for the effect of changes in the the tax-benefit policies which leads to a high explanatory power of this family of models.Moreover, you can disentangle the effects on specific socio-economic groups or differentiate between policy effects and other factors such as changes in the population structure. However, there are several assumptions that might affect the abbility of the micro-simulation model to predict actual changes in observed data. For a limited number of countries there are simulation for tax evasions and non take-up of benefits but in the large majority of cases EUROMOD doesn't inlude behavioral effects due to policy changes. For more details please see also (Leulescu et al., 2016, Rastrigina et al, 2016).
The parametric quantile model (PQM) family relies on the exploitation of macro-level information, provided by National Account figures such as GDP, consumption, wages and salaries. It aims at producing flash estimates of a set of distributional parameters, which are assumed to encode the entire information contained within the income distribution, instead of income distribution itself. The estimate of the distribution is derived at a second-step of the estimation procedure through a mapping algorithm which allows reconstituting the distribution from the set of parameters. The flash estimates of the distributional parameters are obtained on the basis of dynamic factor models. Hence, PQM models are designed to capture the impact of changes in macroeconomic circumstances on the income indicators. They might also capture changes in social and economic policies if and only if these changes are reflected somehow in the National Account figures.
The aim of considering several approaches was to develop a toolkit of methods (models) that can be used in all countries and different macro-economic conditions. Ideally we should have one method that outperforms the other in order to have a common harmonised approach. However, the three aproaches differ in terms of model complexity, the way current information is used in the production of the flash estimates, their predictive power and interpretability. During the consultation with our mains users, DG and Member States they expressed their preference for the microsimulation models for several reasons 1) it reconstructs the whole distribution on the basis of which indicators are estimated so it ensures a certain degree of coherence across indicators 2) it allows to answer questions concerning the plausibility of the estimate by linking changes in the income distribution with changes in policies. However, the most comprehensive model might not constitute the best choice in all countries and can lead to a lower predictive power relative to the simpler models. For the production of flash estimates 2016, the model selection will be based on microsimulation models and a simpler time-series model that would better takes into account the need for simplicity and interpretability expressed by the users.
2.2. Performance measures
The main criteria used for quality assessment in this paper is the historical performance of the model used for calculating the FE, derived from the out-of-sample retropredicted estimates (3 consecutive out-of sample estimates for years 2012-2014).
We have used for illustration in this paper two types of measures: one metric (derived from the actual values of the observation, respectively the nowcasting estimate), the other qualitative (derived from the recoding of the actual values of both the observation and the estimate into discrete, data-driven categories). However, further criteria are being considered in the QAF, based on consultations with main users and member states.
2.2.1. Metric measures
The metric measures compare the observed (YoY.OBS), respectively the nowcasting estimate (YoY.EST) of the percentage change in the target indicator. The nowcasting error is
where OBSy is the observed (SILC) value of the indicator in income year y, and ESTy is the out-of-sample nowcasting estimate of OBSy.
We have used two measures:
(1) Mean Absolute Error (MAE), as an expression of a linear, symmetric loss function:
(2) Root Mean Square Error (RMSE), also an expression of a symmetric loss function, but a quadratic one, and therefore more sensitive to outliers than MAE:
2.2.2. Qualitative measures
The qualitative performance measure focus on identifying the classification errors determined from the comparison of observed and estimated YoY changes in the indicator, both being expressed using the Significance-Direction (SD) scale.
The SD classes are defined based on the country- and year-specific standard deviation of YoY (StDdev). There are meant to take into account the statistical uncertainty of the target indicators and disentangle it from the model error. In this context of decision making under uncertainty model selection algorithms adopt a relativistic viewpoint by aiming to find the best model from a pre-defined set of candidate models.OBS
Significance-Direction (SD) scale / EST / [-] / [Ø] / [+]
Significant decrease / YoY < -2*StDdev / [-] / OK / SE / DE
Quasi-stable / Non-significant change / abs(YoY) ≤ 2*StDdev / [Ø] / SE / OK / SE
Significant increase / YoY > 2*StDdev / [+] / DE / SE / OK
Correct estimate (OK) / Significance Error (SE) / Direction Error (DE)
Significance Error (SE) occurs when the YoY change derived from observed SILC data (YoY.OBS), respectively the one calculated from the flash estimates (YoY.EST), have different significance levels; e.g. one is non-significant, but the other significant.
We have further differentiated between two cases of significance errors:
(1) Sampling Significance Error, when YoY.EST is in the confidence interval of YoY.OBS, meaning that this is a value we could have obtained from the same population by drawing a different sample;
(2) Model Significance Error, when YoY.EST is outside the confidence interval of YoY.OBS; the error is most likely due to the estimation procedure.
Direction Error (DE) occurs when both YoY.OBS and YoY.EST are significant, but point in different direction; e.g. one indicates a significant increase, while the other a significant decrease.
2.3. Model selection algorithms
Model selection algorithms based strictly on the empirical performance take the form of mathematical optimization problems which consist in choosing the optimal element among a set of available alternatives with regard to a specific performance metric. If the objective is to construct nowcasts with maximum predictive power, the optimality criterion takes the form of a performance metric as the ones mentioned above.
The optimal model is equivalent to the candidate model having the lowest historical error. The error is measured in terms of mean absolute error (MAE) of the year-on-year change of the target indicator.
Where t=2012, 2013 and i=AROP, QSR, ARPT, D10, D30, MEDIAN, D70, D90
Based on this, three different model selection algorithms are considered to estimate the target indicators:
1) The best-by country- (bbc)- the model by country that has the lowest MAE calculated for all target indicators;
2) The best- by country and family of indicators (bbcfi):
a) The best-by country and inequality indicators - (bbc1) - the model by country that has the lowest MAE calculated for i=AROP, QSR; and
b) The best-by country and positional indicators- (bbc2)- the model by country that has the lowest MAE calculated for i=ARPT, D10, D30, MEDIAN, D70, D90.
The advantage of using option 1 is that there is only one model per country so that all estimates for the indicators come from the same distribution. The disadvantage is that different indicators seem to have different growth patterns and therefore, a 'best by country' model may lead to a sub-optimal selection of the best model for the inequality indicators. For that reason, we also test the 'best by' by country and family of indicators.
The MAE is based on the historical performance of the flash estimates 2012-13. These approaches are tested for the model selection for flash estimates 2014.
2.3.1. Argument in favour of using different models for each family of indicators
The two families of indicators have a totally different dynamics. Figure 1 illustrates these dynamics with inequality indicators (AROP and QSR) are mostly quasi-stable, whereas the positional indicators display much more significant variation, mostly upwards. These are year-on year changes which based on the standard deviation of SILC can be grouped into three categories: [-] =decrease; [Ø]minor change; [+]increase.
Figure 1: Dynamics of indicators based on year on year changes
The differences are even more visible when data is aggregated by indicator.
Figure 2: % YoY changes across years
We should also note the different dynamic of the quintiles: the upper two are slightly less likely to decrease significantly, and slightly more likely to increase; also, the median is slightly more "active" than the rest.
In table 1, the selected approach, either 'P' (PQM) or 'MS' (microsimulation) for each country and model selection algorithm is presented. We can observe that the most common approach for bbc1 is P, which is selected for 18 countries in 2014, whereas 'MS' is the most common approach for bbc2, with 18 countries as well. The distribution of approaches for the bbc is balanced.
Table 1. Selected approach by model selection algorithmCountry / bbc / bbc1 / bbc2 / bbc vs bbc1 / bbc vs bbc2
AT / P / MS / P / x / =
BE / MS / P / MS / x / =
BG / MS / P / MS / x / =
CY / MS / P / MS / x / =
CZ / P / P / MS / x / =
DE / P / P / P / = / x
DK / MS / MS / MS / x / =
EE / MS / MS / MS / = / =
EL / P / P / MS / x / x
ES / P / P / P / x / =
FI / MS / MS / MS / = / =
FR / P / MS / P / x / =
HR / MS / MS / MS / = / =
HU / MS / P / MS / x / =
IT / MS / P / MS / x / =
LT / MS / MS / MS / = / =
LU / P / P / MS / x / x
LV / MS / MS / MS / x / =
MT / MS / MS / MS / = / =
NL / MS / MS / MS / = / x
PL / MS / P / MS / x / x
PT / P / P / P / = / =
RO / P / P / P / x / x
SE / P / P / MS / x / x
SI / P / P / P / x / x
SK / P / P / P / x / =
UK / P / P / P / = / x
The charts below present the proportion and type of errors done (in terms of significamce and direction) based on the simulated flash estimates 2012-14.
Figure 2: Historical performance in terms of significance-direction error
We are more likely to estimate correctly the inequality indicators when they are quasi-stable, and the positional indicators when they increase significantly.
We are more likely to make direction errors when the indicators decrease, and the further we move towards the righthand end of the distribution.
3.1.1. Performance of the two models
The table below presents the proportion and type of errors done when fitting the model to the data, 2012-13; it includes the performance of the 2014 estimate produced by the model selected based on the average performance 2012-13.
Table 2: Performance in the target year for the two models selected (bbc/bbcfi )
When comparing the two method selection algorithms (BBC and BBCFI), for all 3 years, in general they give similar number of errors; the exceptions are AROP, where BBFI is slightly better, and D10, where the situation is reversed.
If we focus on 2014, BBC and BBFI are virtually identical for the inequality indicators; BBC is slightly better for the positional indicators that describe the lefhand side of the distribution (D10, D30, MEDIAN), but has no lead for D70 and D90.
Table 3. Performance in the target year (2014)
4. Discussion and further developments
The current model selection developed in Eurostat is still under testing and includes several metrics that take into account both the model and statistical uncertainty of the target indicators. However, given the various sources of uncertainty and the short time series, the model selection algorithms are affected by instability, they result in different performances by country and indicator and it is more difficult to capture larger changes. It seems also that some parts of the distribution are more difficult to estimate and this can results in difficulties in giving a more comprehensive message about the dynamics at different points of the distribution. In this context of decision making under uncertainty, the model selection in our case will take into account two other empirical elements: the prediction interval and plausibility. Current work focuses on these two elements:
- with the use of bootstrapping procedures for building the prediction interval tat take into account the two sources of uncertainty
- analysis of the trends and links between expected variation in the indicator, and variations in macro variables or in policies for the plausibility analysis
Beyond empirical performance, based also on the consultation of the users, we consider other important quality criteria. These are related to the advantages of microsimulation in terms of building counterfactual scenarios and linking the expected change to different policies; the use of more straightforward and easier to communicate and explain methodologies. Therefore the macro level models are being revised in order to make the model interpretable from the economic point of view. While in a first phase based on a strict empirical performance assessment we have used for the model selection a linear combination of a subset of models, currently we are focused on choosing a best by country model.