A/72/43103- Supporting documents

Supporting documents

I.Understanding history: de-identification tools and controversies...... 1

A.Aggregation...... 2

B.De-identifying unit-record level data...... 2

C.De-identification techniques...... 2

D.Does de-identification work?...... 4

II.Engagements by the Special Rapporteur in Africa, America, Asia and Europe...... 5

III.Background on the open letter to the Government of Japan...... 7

IV.Activities of the Task Force Privacy and Personality...... 9

V.Description of the process for the draft legal instrument on surveillance...... 10

VI.Acknowledging assistance...... 11

VII.Procedural clarifications on the thematic report on Big Data and Open Data...... 12

I.Understanding history: de-identification tools and controversies

A.Aggregation

1.Producing de-identified aggregate statistics makes sense. This idea dates from early days of data analysis: the United States census from 1910 could tell the public how many people of which sex, age and marital status could read and write, but it did not identify individuals.

2.Population statistics or aggregates can often be safely published without significant risk to individual privacy. However, even this apparently simple approach introduces algorithmic complexity that is in general very hard to solve. The easiest way to see why is to consider the differencing attack. One query asks how many legislators have HIV; the other asks how many apart from the Speaker do. Neither query alone is particularly sensitive, but the combination of answers reveals sensitive information about one individual. In general, computing whether a combination of questions exposes an individual's data is a very hard computational problem.[1]

3.In an aggregated form, de-identified personal information has numerous uses, such as tracking disease vectors, public health outcomes, population patterns or the effectiveness of emergency or humanitarian relief interventions.

4.By contrast, “Unit-record level” data means that one individual’s record becomes one record in the published dataset.

B.De-identifying unit-record level data

5.Records generated from each individual’s use of telecommunications, web browsing, medical history, etc. contain a great deal of information about them. This raises the question of whether the personal information can ever be separated from the rest of data. The debate about whether sensitive unit-record level data can be securely de-identified has become one of the most contentious issues in international information privacy law debates.

6.For example, a dataset listing the mobility data of many people might include each person's location (within their cellphone-tower range) at intervals of time (for example, every five minutes). The concern with mobility data is that if an attacker knows a person's precise location at a particular time, this might be enough to identify their record and hence retrieve all the rest of their movement data. The dataset might therefore be de-identified by randomly perturbing the times (by adding or subtracting a random time from 0 to 30 minutes, for example) or by aggregating the phone towers (gathering them into groups to give a much wider set of possible locations). The idea is to hide each individual in a large "crowd" so that an attacker cannot isolate one person’s record even given their precise location at particular times.

7.Unfortunately, even the term "de-identified" is used in two quite different senses. One refers to an algorithm or process for removing obvious identifiers; the other refers to reaching a state in which the record cannot be re-identified. Without a precise definition, a number of heuristic approaches have been tried in practice. When the resulting datasets have been made public, some records have been re-identified.

C.De-identification techniques

8.At the outset, it is important to note that the use of terminology and definitions in this complex field of study is not always consistent. For example, distinctions can be made between terms like ‘de-identification,’ ‘anonymisation’ and ‘pseudonymisation.’[2] However, for the purposes of this report it is not necessary to delineate between them. ‘De-identification’ is used as a general term that covers any process used to separate identity from information in order to prevent identity from being inferred.

9.There are a number of commonly used de-identification techniques, which can be used singly or in combination. These tend to be developed to meet particular standards or definitions.

1.US Health Insurance Portability and Accountability Act 1996

10.One of the most notable, and often cited, de-identification standards is contained within the United States Health Insurance Portability and Accountability Act 1996. The act defines two options for performing de-identification. The expert determination approach requires an expert with relevant experience in the relevant methods for making information not individually identifiable. The Safe Harbor pathway is far more formulaic. It defines 18 types of identifiers, such as name, phone number, and social security number, which must be suppressed or modified.

2.k-anonimity

11.An early rigorous definition of privacy is k-anonymity.[3] It is designed to start with a dataset of individual records such as medical records combined with demographic data. It assumes that the types of data that could be used for linking are known in advance – they are called “quasi identifiers”. De-identification techniques are then applied to the quasi-identifiers, assuming that the rest of the data cannot be linked. The aim is that, among the quasi-identifiers, each individual should be part of a set of at least k indistinguishable people.

12.For example, consider a database that includes the gender, address and date of birth of a few hundred people. Most people will be unique according to those three attributes, thus exposing any other information in their record. The simplest way to reduce uniqueness is generalization – simply broaden the values in each category. Instead of including whole addresses, they could be listed only as a region or province, for example, and the dates of birth given only by year or month. This will substantially reduce the rates of uniqueness, and some people (such as those who live in large regions) will probably be part of a large set of identically-valued individuals. However, some extraordinarily old people in small regions might still be unique. For these cases suppression is employed, meaning that their dates of birth are completely omitted. The treatment achieves k-anonymity if every sequence of values of quasi-identifiers that occurs in the database appears at least k times.

13.This approach still leaves at least two possible sources of privacy breaches: the possibility that all k similar individuals all carry the same trait, and the possibility that data other than the quasi-identifiers could be used to re-identify records. These observations have led to stronger definitions such as l-diversity[4] and t-closeness.[5]

14.The most important limitation of this approach is the assumption that only the quasi-identifiers can be used for re-identification.

3.Statistical perturbation

15.Another approach is to add some random noise to the data points. For example, a date of birth could have a randomly chosen few days added or subtracted. This makes sense for numerical data.

D.Does de-identification work?

16.Simply put, `”de-identified”' data is not, and the culprit is auxiliary information.[6]

17.If proper de-identification techniques and re-identification risk measurement procedures are used, re-identification remains a relatively difficult task.[7]

18.Does de-identification work? Are there techniques for de-identification that `”work”' because they protect the privacy of sensitive unit-record level data while preserving most of the scientific value of the data?

19.The clearest and most simple demonstration that de-identification does not work is re-identification: showing that the identity of the person can be inferred from the data. A narrower perspective requires the de-identified data to be turned back into personal information that relates to an identifiable person.

20.The possibility of re-identification is greater in the Big Data – Open Data world. Success can depend on auxiliary information - extra information about the person (their age, place of work, medical history, etc.) that can be used to identify their record in the dataset. If an attacker trying to re-identify individuals does not know much about them, re-identification is unlikely to succeed. If the attacker has a vast dataset (with names) that closely mirrors enough information in the de-identified records, re-identification is assured.

21.For a particular collection of auxiliary information, we can ask a well-defined mathematical question: Can the adversary identify someone uniquely based on just that auxiliary information?

22.A person’s record can be linked between two different datasets by identifying a combination of features that are unique in both. If one dataset has names, linking is called re-identification. In a de-identified mobility dataset showed that combinations of data points are still highly likely to be unique.[8] If the attacker knows a few different locations where the target person was at certain times, this is highly likely to be enough to find their record. The researchers showed that aggregating into larger geographical errors, or perturbing the times by larger amounts, makes only a small difference. When a person's record contains many independent points of data (precisely what makes it interesting for research), it is likely to be unique even when a substantial amount of the information is removed.

23.Computer scientists have used linkage attacks to re-identify data from various sources including telephone metadata, mobility data,[9] social network connections,[10] health data,[11] credit card transactions[12] and online ratings.[13] They work by identifying a “digital fingerprint” in the data, meaning a combination of features that uniquely identifies a person. If two datasets have related records, one person’s digital fingerprint should be the same in both. This allows linking of a person’s data across the two different datasets – if the additional dataset has names then the "de-identified" dataset can be re-identified. This is not necessarily sophisticated: re-identification based on simply linking with information publicly available online has also been reported.[14] It is not surprising there are disasters when the "de-identified" data is not produced in a rigorous way.

24.Re-identification is, of course, the strongest possible form of attack on a de-identified dataset. It is still possible that significant information may leak even if an attacker cannot identify individual records. For example, the attacker may be able to narrow the possible matches to two records that share characteristics. This has led to more sophisticated definitions of privacy breaches that emphasize the information leaked, rather than re-identification only.

E.Differential privacy

25.Differential privacy[15] is a rigorous definition of how much (or how little) information about an individual is leaked when releasing aggregate statistics of a sensitive dataset to untrusted third parties. It permits a very strong attacker who knows everything about the dataset, the release mechanism, but not whether the individual's record is in the dataset or not. In the area, utility is measured as deviation of private responses, from ideal non-private responses.

26.Informally, an algorithm is differentially private if its outputs do not change very much when one person (i.e. one record) is added or removed from the dataset. A key requirement for differential privacy is randomization: deterministic algorithms cannot be differentially private; the response distribution of a differentially-private algorithm must not change too much when any individual record is removed. Major generic approaches to privatizing a non-private target function involve perturbing the target's input (the original data), elements of the target's transformation, or the output of the target by adding noise. These techniques have been applied to mobile device telemetry, census data, recommendation systems, and more. A surprisingly small perturbation can introduce a large uncertainty about individual people. Typically, the more sensitive non-private responses are to input perturbation, the more randomization is required to maintain differential privacy, at a cost to utility. A calculus of composition delivers rigorous results on the combined effects of answering multiple queries or combining different sensitive datasets.

27.Importantly, differential privacy is not a guarantee of perfect privacy, which would be impossible in this setting, of achieving utility by providing some amount of information about the dataset as a whole. Many important results in this field are lower bounds: proofs that not too many queries can be answered with reasonable accuracy without exposing information about individuals.[16] This is a fundamental limit on the tradeoff between accuracy and privacy of randomly-perturbed data.

II. Engagements by the Special Rapporteur in Africa, America, Asia and Europe

1.71th Session of the United Nations General Assembly, New York, United States, 24 October 2016

2.MAPPING project second General Assembly, Prague, November 2016

3.46th Asia Pacific Privacy Authorities (APPA) Forum, Mexico City, 30 November to 2 December 2016

4.Cyberspace Conference 2016, Brno, Czech Republic, 25-26 November 2016

5.Irish Civil Liberties Council, Dublin, 7 December 2016

6.Human Rights Commission Annual Statement Launch, Belfast, Northern Ireland, 8 December 2016

7.10th International Conference on Computers, Privacy and Data Protection (CPDP2017), Brussels, 25-27 January 2017

8.Spain – 01-02 February 2017 - Madrid

9.58th Annual Convention of the International Studies Association, Baltimore, United States, 22-25 February 2017

10.Internet Corporation for Assigned Names and Numbers, ICANN 58 Communit Forum,– 11-16 March 2017

11.2017 European Broadcasting Union Big Data Conference, Geneva, Switzerland, 21 March 2017

12.Malta IT Law Association Cybercrime Conference, St Julian’s, Malta, 28 March 2017

13.UNICEF Panel Discussion on Child Online Rights, Privacy and Freedom of Expression, RightsCon 2017, Brussels, 29 March

14.GIG-ARTS (Global Internet Governance Actors, Regulations, Transactions and Strategies) Conference, Paris 30-31 March 2017

15.Spain – Barcelona 03 April 2017

16.Ireland – Dublin – 04- April 2017

17.UK – Northern Ireland – 05 April 2017

18.2017 Annual Conference of the British and Irish Law, Education and Technology Association, Braga, Portugal, 20-21 April 2017

19.Indonesia – 23 April-04 May 2017

20.Privacy, Personality & Flows of Information in the MENA region, Tunis, 23-25 May 2017

21.Data Summit, Dublin, 15-16 June 2017

22.Official country visit to the United States of America, Washington, New York, Chicago, Sacramento, San Francisco, 19-28 June 2017

23.Meetings with stakeholders, Geneva, Switzerland, 28-30 June 2017

24.Privacy Laws & Business 30th Annual International Conference, Cambridge, United Kingdom, 3-5 July 2017

25.MAPPING Workshop on the Surveillance Legal Instrument, Paris, 13-14 Sep 2017

26.MAPPING Law Enforcement Workshop on Surveillance Legal Instrument, INTERPOL, Lyon, France, 15 September 2017

27.International Conference of Data Protection & Privacy Commissioners, Hong Kong, China, 25-29 September 2017

28. Privacy, Personality & Flows of Information in Asia, Hong Kong, 29-30 September 2017

29. Keynote Speech, National Symposium on Surveillance, Japan Civil Liberties Union; Keynote Speech and panel discussion “Privacy and Personality in Japan: past, present and future” seminar, Waseda University; keynote speech at Seminar on Surveillance and Safeguards organized by the Japan Federation of Bar Associations; Tokyo, 1-3 October 2017

This list does not include remote participation in events in Ghana (April 20th), Japan (multiple, May-June 2017).

III. Background on the open letter to the Government of Japan

1.On 18 May 2017,the Special Rapporteur took the unusual step of publishing an open letter to the Government of Japan on the Special Rapporteur’s mandate’s page at OHCHR’s website, two hours after OHCHR having faxed the letter to the Permanent Mission of Japan to the United Nations Office and other international organizations in Geneva. This letter, available at was addressed directly to the Prime Minister of Japan in response to the latter’s insistence on introducing a law commonly dubbed as “the anti-conspiracy bill”, ostensibly to permit Japan to ratify the 2000 United Nations Convention against Transnational Organized Crime.

2.It is important to emphasize that the Special Rapporteur was compelled to write an open letter to the Japanese Government because of the extremely short timeframe which the Government had set itself for having the law passed by the National Diet. Under normal circumstances, the Special Rapporteur would have proceeded with a whole series of actions out of the public eye in direct dialogue with the Government. This was not possible in a situation where the Government set itself a deadline of less than 90 days for getting the bill through both chambers of the National Diet, a process that the Government initiated after two previous failed attempts to pass such a bill over a period of ten years.

3.The method chosen and the extremely short time-frame pursued by the Japanese Government raised suspicions. These suspicions were further reinforced by the argumentation publicly presented by the Japanese Government i.e. that the new legislation was needed to enable Japan to accede to the 2000 United Nations Convention against Transnational Organized Crime in order to be better able to prevent terrorism ahead of the 2020 Summer Olympic Games in Tokyo. The argument does not stand, as the treaty was never designed to counter terrorism but rather organized crime, money laundering and drug trafficking. As a number of other experts have testified, including some in the Japanese Diet, most of the provisions contained in the anti-conspiracy bill were unnecessary for Japan to be able to accede to that convention, which only required Japan to introduce the criminalization of forming a conspiracy. Moreover, the need to introduce new legislation would not justify the absence of the privacy safeguards the Special Rapporteur indicated in his open letter on 18 May 2017. The Japanese Government’s argumentation is based on the type of political rhetoric which the Special Rapporteur categorically criticized and rejected in his report to the Human Rights Council in March 2017, where the Special Rapporteur is taking political leaders word-wide to task for using the psychology of fear in order to push through legislation that infringes on the right to privacy. This is not something which has happened only in Japan. The Special Rapporteur has also consistently criticized other Governments who have used the fear of terrorism to push through legislation which falls short of their international human rights law obligations. . The Government of Japan deposited its instrument to ratification for the said Convention on 11 July 2017 directly after the law was enacted. However, the extent of the powers created by the law is not yet be proven to be necessary and proportionate in a democratic society.