in partnership with
Title: / Guidelines on detecting and treating outliers for the S-DWHWP: / 2 / Deliverable: / 2.8
Version: / 1.0 / Date: / 27-2-2013
Autor: / Gary Brown / NSI: / ONS (UK)
ESS - NET
On micro data linking and data warehousing
in production OF BUSINESS STATISTICS
1. Introduction
The European Statistical System Network project on Micro Data Linking and Data Warehousing (ESSnet DWH) is a joint venture between seven countries: Estonia, Italy, Lithuania, Netherlands, Portugal, Sweden and the UK.
From October 2009 to October 2011, the first stage of the ESSnet DWH gathered data on use of data warehouses and integration between data warehouses and business registers across Europe. From October 2011 to October 2103, the second stage of the ESSnet DWH aims to identify best practice and write guidelines. The second stage is split in three separate work packages (WPs): WP1 on Metadata (led by Estonia); WP2 on Methodology (led by the UK); WP3 on Architecture (led by Italy).
This report constitutes one of the eight deliverables of WP2.
2.1 Metadata
2.2 Business Register
2.3 Architecture
2.4 Data Linking
2.5 Confidentiality
2.6 Selective Editing
2.7 Mapping
2.8 Outliers
2. Definition of an outlier
An outlier is defined in the Eurostat “Statistics Explained Glossary” and the Organisation for Economic Co-operation and Development (OECD) “Glossary of Statistical Terms” as:
“A data value that lies in the tail of the statistical distribution of a set of data values”
The UK Office for National Statistics “ONS Glossary” defines an outlier as:
“A correct response, usually an extreme value isolated from the bulk of the responses,
or has a large sample weight that would have an undue influence on the estimate”
The differences between these two definitions illustrate that whether a data value is an outlier depends on the context. In the context of a data warehouse there are three types of outliers – outliers in survey data, outliers in administrative data, and outliers in modelling. Each of these types is explained below, but first a clear distinction is made between outliers and errors.
2.1 Outliers and errors
Making a clear distinction between an outlier and an errors is vital. Errors are erroneous values, and are dealt with via editing rules. The OECD Glossary defines an edit rule as:
“A logical condition or a restriction which must be met if the data is to be considered correct”
whereas the ONS Glossary definition is:
“A rule designed to detect specific errors in data for potential subsequent correction”
The common theme here is the requirement ‘to correct’ – errors are assumed to be incorrect, outliers are asumed to be correct. This assumption is the foundation for the rest of this report.
2.2 Outliers in survey data
In the survey context, an outlier is an unrepresentative value.
A survey selects a sample of a target population, collects information from this sample, and uses weighting to estimate population values. The quality of the population estimates depend on the representativeness of the sample – ie whether it includes all types of respondents found in the population – the size of the sample – ie whether enough are sampled for an accurate estimate – and the appropriateness of the weighting – ie whether population weighting represents the unselected population. Common sampling methods are: systematic random sampling for individuals; probability proportional to size for households; simple random sampling for businesses.
Using simple random sampling as an example, an outlier occurs when a sampled unit is assumed to represent N/n unsampled values – where N is the number of this type of units assumed to be in the population, n are sampled, and N/n is large – but the sampled unit is actually unique (or nearly unique) in the population.
This type of non-representative outlier “would have an undue influence on the estimate” (ONS Glossary). A “data value that lies in the tail” (OECD Glossary) will only be an outlier in the survey context if the population weight is different from 1.
2.3 Outliers in administrative data
In the context of administrative data, an outlier is an atypical value.
Administrative data represent a census of a target population, so weighting to estimate population values is not required. Each unit is treated as unique in the population.
This type of outlier is simply extreme: a “data value that lies in the tail” (OECD Glossary); and “an extreme value isolated from the bulk of the responses” (ONS Glossary). An outlier in administrative data would not be treated as it would be assumed to be a genuine correct value.
2.4 Outliers in modelling[1]
In the context of modelling, an outlier is a influential value.
In the ONS Glossary, influence is defined as:
“The amount of effect a particular point has on the parameters of a regression equation”
An atypical “data value that lies in the tail” (OECD Glossary) is nearly always an outlier in modelling. For example, Figure 1 shows an outlying value in the horizontal ‘x’ axis – note the same unit is not an outlier in the vertical ‘y’ axis. Figure 2 shows the same dataset with an OLS[2] regression line fitted. Whether the outlier is influential (on the regression line) depends on which regression line (figure 3) would be fitted if the outlier was removed: red = no, green = yes. The reality is likely to be less extreme – red is unlikely by chance, and green is unlikely if the outlier is not an error, as it should exhibit the same relationship between y and x as the other points. The expected influence of the outlier simply comes down to it’s high ‘leverage’ on the regression line – as this is directly proportional to the size of it’s x value.
Figure 1: an outlier in x
Figure 2: regression line with outlier
Figure 3: regression line without outlier
If WLS[3] regression was fitted to the dataset in Figure 1, then even data values in the cloud with “a large sample weight” (ONS Glossary) could be influential.
3. Identification of outliers in a data warehouse
A data warehouse stores survey and administrative data for repeated use in estimation and modelling. The data may not be pre-cleaned before storage, so editing might be required as part of the pre-processing of data. Similarly, the data may not be complete, so imputation of missing data may also be required as part of pre-processing. However, both editing and imputation will be complete before outliers are identified and dealt with, so whether these processes take place inside or outside the data warehouse is immaterial for this report.
Before outliers are discussed, units in a data warehouse need to be defined:
· each unit represents an individual (a person or a business) or an aggregate (a household, an enterprise group, or a domain)
· each unit has multiple data fields (variables) linked to metadata fields (for example, dates)
The permutations are challenging:
· each unit answers Q survey questions for P periods and has A administrative records for T periods – ie QP + AT entries
· each entry can be used in estimation at M multiple domains at N periods – ie (QP + AT)MN uses
· for modelling, any combination of fields can be used – ie QPATM combinations
As any entry could be an outlier, in the context of it’s use in estimation, and any combination of fields may result in outliers influential in modelling, it is impossible to attach a meaningful outlier designation to any unit. The only statement that can be made with certainty is:
every unit in a data warehouse is a potential outlier
It is not even possible to attach an outlier designation to any entry – as it would have to record the use – ie the domain and period for estimation, and the fields combined and model used.
Given that neither the units in a data warehouse, nor the specific entries of units, can be identified as outliers per se, identification is context-dependent. This means that outliers are identified during processing of source data, and reported as a quality indicator of the output – if the output itself is stored in the data warehouse, the outliers will become part of the metadata accompanying the output, but will not be identified as outliers at source. The expected uses of a data warehouse and proposed identification methods are listed in table 1.
Table 1: Identification methods for outliers in a data warehouse
Data use / Outlier identification methodsProcessing (edit rules and imputation models) / Plotting/sorting/testing/setting cut-offs
Examples
· checking observed against expected numbers of edit failures of a rule
· sorting ratios to identify extremes when calculating an imputation link
Updating the business register / Edit rules
Detail
Business registers are updated from multiple sources – for example surveys, and administrative data. Priority rules decide between sources – for example, use administrative data unless 3 years older than the survey. If the difference between sources is above a limit (an edit rule) an outlier is identified
Estimation of survey variables / One-sided winsorisation
Detail
Returned survey values yi > Ki are identified as outliers:
Ki =μi+ Mwi-1
where:
· μi is the fitted value for yi – often the stratum mean
· wi is the population weight for yi
· M minimises the mean squared error – based on previous survey data
Estimating calibration weights / Set an acceptable range for calibration weights (Hedlin et al, 2001)
Detail
Any calibration weights outside acceptable range are identified as ‘outliers’
Modelling the relationship between survey and administrative variables / Plotting/sorting/setting cut-offs
Examples
· Cook’s distance (Cook, 1977) calculates the influence of each datum on regression parameters – plotting these (against index) identifies outliers
· Least trimmed squares (Rousseeuw, 1984) sums the smallest k absolute residuals/errors – a sudden increase identifies the (n-k) largest as outliers
Estimating missing survey variables from administrative
variables / One-sided winsorisation (see above) of the estimated variables
Detail
Treat the model-based survey estimates as returned survey values (as above)
4. Treatment of outliers in a data warehouse
All identification methods fundamentally do the same thing: they set a limit above which any value is an outlier. Once identified, outliers need to be treated to prevent distortion of estimates/models.
Treatment fundamentally consists of weight adjustment:
· an adjustment to 0 percent (of original) equates to deleting the outlier
· an adjustment to P percent (of original) equates to reducing the impact of the outlier
· an adjustment to 100 percent (of original) equates to no adjustment of the outlier
Treament will reduce variance but introduce bias, so the challenge is to find P that minimises the mean squared error. Table 2 lists methods against expected uses of a data warehouse.
Table 2: Treatment methods for identified outliers in a data warehouse
Data use / Outlier treatment methodsProcessing (edit rules and imputation models) / Delete outliers, impute or robustify rules or models (see examples)
Examples
· Use medians rather than means
· Use quartiles/quantiles rather than maxima/minima
· Use ratio of means rather than mean of ratios
Updating the business register / Decide which source is an outlier, and update the register based on the other
Examples
· Check coherence of both sources with similar units on register
· Recontact data sources (if possible) to gather extra information
· Prioritise future collection of repeat data from survey source
Estimation of survey variables / Winsorisation automatically replaces outlier yi by yi+wi-1Kiwi
Estimating calibration weights / Restrict outliers to upper or lower limits of acceptable range of calibration weights, dependent on whether they are above the upper or below the lower, respectively
Modelling the relationship between survey and administrative variables / Exclude outliers from modelling process (ie deletion)
Estimating missing survey variables from administrative / Winsorisation (as above)
5. Recommendations
1. Neither data units nor their entries in a data warehouse should be labelled as outliers
2. Identification and treatment of outliers should be unique to each instance data are used
3. Metadata on outliers should only be included in a data warehouse alongside outputs
4. Refer to tables 1 and 2 for guidance on use-specific identification and treatment of outliers
References
Cook, R. D. (1977). Detection of Influential Observations in Linear Regression. Technometrics, 19,
15-18.
Hedlin D., Falvey H., Chambers R., Kokic P. (2001). Does the Model Matter for GREG Estimation? A
Business Survey Example. Journal of Official Statistics, 17, 527-544
Rousseeuw, P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical
Association, 79, 871-880.
7
[1] Modelling includes setting processing rules (for example, editing/imputation), as well as statistical modelling
[2] Ordinary Least Squares
[3] Weighted Least Squares