Methodology for Anonymising CSIS Microdata

/ EUROPEAN COMMISSION
EUROSTAT
Directorate B: Quality, methodology and information systems
Unit B-2: Methodology and research /

ESTAT/B2/EGSDC(11)1
Doc. 5 – V.2011.03.21
Available in EN only

1st meeting of the
Expert Group on SDC

Luxembourg, 5 and 6 April 2011

BECH building, Room B2/464

Starting: 13:00

Item 5
Anonymisation of CSIS

Anonymisation of CSIS

The method for anonymisation of CSIS has been developed by means of the framework contract on methodology (contractor – Sogeti, private company located in Luxembourg). The author of the anonymisation method is Mr Philip Lowthian from the Office for National Statistics (UK).

The Expert Group is asked to discuss the proposed approach to the anonymisation.

Methodology for Anonymising ISS Individuals and Household Microdata

Introduction

General details on the background to this survey along with general details on how to protect any released microdata were presented at the Working Group in March 2010. This included details on Intruder Scenarios and Key Variables while potential disclosure risk was considered by means of examining combinations of frequencies of key variables.

Methods of determining disclosure risk in a microdata file were also considered. In brief these were:

· The ONS approach based on population uniques and sample uniques, implemented in the SUDA package.

· The CBS approach based on sample counts and implemented in Mu-Argus

· The ISTAT approach is based on individual and household measures of disclosure risk.

· The Poisson model / loglinear model (developed by Elamir and Skinner at Southampton University).

Finally approaches to reducing the number of risky records in a file were documented. At that time there was no formal indication about which method of determining disclosure risk was to be applied and there was little statistical detail about the methods.

Following on from this a comparison was made in very general terms to approaches to quantifying disclosure risk. Some theoretical background was given to Mu-Argus, SUDA and the Poisson / loglinear methods and a scoring system was developed in order to determine the most appropriate method. An approach using Mu-Argus was recommended most likely using the ISTAT scenario with a match being given when the combination of key variables is the same in both the sample and the intruder's database.

This is a detailed paper on the subject of anonymisation methodology for research microdata. It will aim to produce a framework for anonymising the microdata and will contain background on the selected method along with an algorithm for application of the method.

Quantifying the overall risk and impact of releasing ISS individuals and household microdata

This is a general section where a non theoretical attempt is made to quantify the ISS individual and household data in terms of risk and impact if released.

How are Impact and Risk defined?

Low Risk: The record level data is at a high geography level (NUTS1 for example) or contains very few variables and cannot be linked to other individual or business records.

Low Impact: The data values are not sensitive or likely to cause embarrassment if released. Maybe these could be the results of a survey on some relatively non sensitive matter.

High Risk: Record level data at a low geographical level (or a low industry classification). This will contain a large number of variables with detailed information.

High Impact: The data values obtained for each individual or business are very sensitive. Maybe there is detail on family illness or relationships or a large amount of personal opinion on sensitive topics.

The relationship between risk and impact is shown in the table below.

A crude overall measure for each microdata file could involve determining a value for both risk and impact then multiplying these together to give an overall score for the file.

Impact
Risk / 1 / 2 / 3 / 4 / 5
1 / High level non sensitive data / → / High level data but contains sensitive information
2 / ↓ / ↓
3
4
5 / Low level data but containing largely non sensitive records / → / Low level data but containing sensitive records

The ISS individual and household survey asks questions concerning internet use by members of the household being surveyed. This includes topics such as internet use for shopping for goods and services. Here the categories include ‘Food or groceries’ and ‘Films, music’ but they are not broken down further.

My opinion is that this survey is of relatively low risk and impact. The data are collected at a high geographical level (NUTS1 or NUTS2 in some cases) and can be regarded as not especially sensitive as the questions do not probe in detail what has been purchased over the internet or which websites have been visited.

In my view both risk and impact can be given a rating equal to 2 giving an overall score of 4. This is based on the assumption that data are released at NUTS1 level.

The next step is to consider the theoretical approach to disclosure risk.

Background to estimating disclosure risk

The Mu-Argus manual and ESSNet documentation gives a good deal of detail on how the Mu-Argus aims to quantify disclosure risk.

Individual Risk and Re-identifcation

Consider the case where a combination of key variables in a dataset are chosen (for example age x marital status x occupation). A frequency table produced from these variables has K cells with each individual cell being indexed by k.

For the population, the count in each cell equal Fk and for the sample the count in each cell equals fk.

Summing Fk over all K cells produces the size of the population (= N). Fk is usually unknown and estimated by the sample.

Summing fk over all K cells produces the size of the sample (= n).

The theory behind the Mu-Argus Model assumes the model below for combinations of key variables. The posterior distribution of Fk|fk is a negative binomial with success probability pk and number of successes fk. Fk|fk is a negative binomial variable which counts the number of trials before the jth success each with probability pk.

The individual risk of disclosure is measured as the posterior mean of 1/Fk with respect to the distribution of Fk|fk.

To summarise the above model NB( fk, pk) is the Negative Binomial distribution taking integer values {0,1,2} which counts the number of failures until fk successes are observed, each success having a probability of pk. For the data in the distribution takes an integer value until a cell where the cell value equals fk is found.

The next step is to find an estimate for pk which would represent an estimate of the individual risk of disclosure as the risk is a function of fk and pk.

From the Negative Binomial distribution:

Fk is unknown and can be estimated by the sample weights

The estimate of pk is shown below

This can be extended to produce an estimate for individual risk for a specified cell k in the frequency table.

The expected value of 1/Fk given the sample data can be calculated analytically (not shown) and it can be shown that the individual risk measure for fk = 1 is equal to:

When fk =1 this means that there is a single value in the cell defined by the key variables (i.e. a sample unique).

The individual risk measure above is the calculation of the individual risk of re-identification of a particular record in the sample i.e. this record is correctly identified by an intruder by using the information in cell k which is a sample unique with reference to the combination of variables selected.

Values of rk can also be calculated for fk=2 (sample pairs), fk=3 (sample triples) etc can also be calculated applying an expansion of the hypergeometric function (not shown).

The estimate of rk is obtained by replacing pk with its estimate. For the global risk measures τ1 (Number of sample uniques that are population uniques)and τ2 (Expected number of correct matches for sample uniques to the population) the following are obtained:

where SU is the set of sample uniques.

Global measures can also be expressed slightly differently by using the expected number of re-identifications and the re-identification rate.

The expected number of re-identifications (ER) can be shown to equal Σfkrk summed over cells k=1….K.

From this the re-identification threshold can be defined as ξ = (1/n)*ER = (1/n) Σfkrk. This is sometimes expressed by multiplying by 100 and defining as the percentage of expected re-identifications.

Threshold and re-identification setting

Members of a cell k contribute the same number of re-identifications (fkrk) to the re-identification rate defined above.

As members of a particular cell k have the same individual risk, each k can be arranged in ascending order of risk rk.

Typically the risk threshold for an individual is defined by r*. Unsafe cells are those where rk ≥ r*. The cells can be indexed by K* (relative to r*) which distinguishes between safe and unsafe cells.

Define K* such that r(K*) < r* and r(K*+1) ≥ r*

Unsafe cells are indexed by (k) = K*+1…K. Once protection is applied to the data cells that are currently unsafe will be safe i.e. rk ≤ r*.

An upper bound can be set for the re-identification rate ξ* substituting rkfk with r*f(k) for each k in the calculation of ξ*.

It is also possible to define a threshold on the re-identification rate ξ. This requires a cell index K* to be found that keeps the re-identification rate ξ of the released file below the defined value τ. K* can be found by an iterative algorithm.

Household Risk

Household risk is defined as the probability that at least one individual in the household is re-identified.

As all the individuals in a household have the same household risk (but not the same individual risk) the expected number of re-identified records in household g = |g|rgh.

The re-identification rate will then be

ξ h = 1/n Σ|g| rgh

This re-identification rate can be used to define a threshold rh* on the household risk r*. Alternatively the user can set a threshold value in order to see how many unsafe houses are present.

The next stage is to consider how the above approach at both an individual and household level can be applied in Mu-Argus to produce a confidentialised ISS individual and household microdata file.

Methodology

Using Mu-Argus with the ISTAT approach

The ISTAT approach is based on calculating individual and household measures of disclosure risk. It has previously been discussed at Eurostat/UNECE conferences. In summary

· Base individual risk (BIR) - disclosure risk based on a probabilistic estimation of the individual re-identification risk. Here records in the file are independent of each other.

· Base household risk (BHR) – This is calculated when the data have a hierarchical structure. This can be calculated when a household identifier is shown with each record.

This follows the database scenario with a match being given when the combination of key variables is the same in both the sample and the intruder's database.

The disclosure risk is defined as the probability that a match is correct (the record in the sample directly corresponds with that in the attacker’s database). The risk is estimated by taking the sampling design into account and estimating it in Mu-Argus.

Following this estimation of the risk a threshold needs to be decided upon.

Setting the Threshold

A risk threshold value and corresponding re-identification rate or maximum number of re-identifications (for both individuals and households) needs to be determined early in the process. This determines an acceptable level of risk in the released data. In general terms the threshold is chosen depending on the assumed scenario and on how the released data is to be used (research or public use).

Here we are looking at a file to be released for research by Eurostat on CD-ROMs. As the data are not to be made public the threshold here could set to be relatively high allowing for greater data utility. A subjective judgement of the availability of the external archive can also affect the threshold level.

Other factors include:

· Availability of intruder’s database providing identifiers and key variables

· Data base level of completeness and quality. Lower quality data offers greater initial protection

· Comparability between database and microdata to be released

· Sampling fraction

An example of individual risk thresholds:

· If threshold = 0.01 an individual is ‘safe’ if there are 100 individuals with the same combination of key variables.

· If threshold = 0.05 an individual is ‘safe’ if there are 20 individuals with the same combination of key variables.

An example of re-identification rate:

· If the sample size = 2000 and the number of expected re-identifications = 50 then the global re-identification rate (the proportion of correct re-identifications which can be obtained by an intruder when comparing target individuals in a sample with a population register containing key variables) equals (50/2000)* 100 = 2.5%.

· If the sample size = 2000 and the number of expected re-identifications = 150 then the global re-identification rate equals (150/2000)* 100 = 7.5%.

Mu-Argus: A brief summary

Mu-Argus is an interactive program which allows the user to:

· Select a combination of key variables to tabulate.

· Create a risk chart of this table. This displays on a logarithmic scale a histogram of risky records present at different risk thresholds.

· Set the individual risk threshold manually by moving the cursor along the risk chart. A pre-specified risk threshold (or re-identification rate threshold) can also be entered

· Altering the individual risk threshold leads to the threshold re-identification rate (global) and the number of unsafe records being updated automatically.

· Once a required individual risk threshold is determined Mu-Argus allows the user to carry out confidentiality techniques such as global recoding in order to reduce the number of risky records.