August 19, 2009

Has there been a failure of anonymization?

Paul Ohm recently put out an article where he makes the dramatic claim that de-identification has failed (see I have heard that argument before and the argument’s primary weakness is amplified in this article – therefore I feel compelled to comment.

Paul Ohm’s argument about the failure of anonymization is based on evidence that does not actually support his point. Therefore, his overall argument about de-identification is very questionable. Below I will explain why.

The key point is that existing re-identifications successes demonstrate the de-identification does not work. This, of course, assumes that the datasets that were re-identified was properly anonymized – it was not. One example that Ohm uses to make his case is the insurance database released in Massachusetts more than a decade ago (pre-HIPAA). That database was not properly anonymized and no professional working in this field would say that that was a properly anonymized database. The Group Insurance Commission did a lousy job. The second example is AOL – which again is an example of a database that was not properly anonymized. AOL did a lousy job in anonymizing their database. In fact the examples he cites were cases where the custodian did not use existing re-identification risk measurement techniques and did not use de-identification techniques that are available in the literature. We know how to de-identify datasets properly (up to a pre-specified threshold) and in none of those examples was this done. There is no example of a database that has been properly de-identified being re-identified.

So I want to make a distinction between lousy practice and good practice. Being a software engineer in a previous life, I will use a software example. There are different levels of maturity in software development: lousy and good. We measure project risk using a maturity scale. If I see a couple of software projects that produce buggy software and do not deliver on time, I would not conclude that all software development is lousy and therefore software engineering is dead and should be abandoned – which is what Ohm’s reasoning would lead me to. It just happened that I selected a couple of low maturity projects, and if I had selected high maturity projects I would have a very different picture.

Ohm has taken examples of poorly de-identified datasets that were re-identified and drew broad conclusions from those. A truly sophisticated custodian would measure the risk of re-identification (see and the references therein for examples), and if it is too high then the custodian would use a contemporary de-identification technique to de-identify the data (see and the references therein for examples).

If a custodian discloses a dataset that has proper identity and attribute disclosure control (i.e., the risk of re-identification is below a threshold), and an intruder demonstrates that the risk of re-identification is higher than the threshold, then there should be concern. This article does not demonstrate that at all. However, a valid conclusion from the article would be that if you do lousy de-identification then the data is easy to re-identify.

Therefore, extreme caution is advised here.

August 19, 2009 | [Originally posted on the ehip blog]

Comments

Thank you for reading the paper and for your helpful comments. I must respectfully disagree with your conclusions. Of the five or six examples I cite in my paper, you cite only two and you neglect to mention the studies by Vitaly Shmatikov, Arvind Naryananan, and Justin Brickell that all demonstrate reidentification studies that worked on what I believe you would call "properly de-identified."

Second, you neglect to mention at all the theoretical work I cite by Shmatikov, Brickell, and Dwork that demonstrate general limits of de-identification.

Third, your failure to cite Shmatikov and Brickell is especially curious because you then point to a paper in JAMIA about k-anonymity and you refer to this as an example of what a "truly sophisticated custodian" would do. The research by Shmatikov and Brickell does a fairly good job showing the significant (some say fatal) limits of k-anonymity.

So thank you once again for the response. One of my most sincere desires is that I can use this paper as a vehicle to reach out to communities in health privacy and medical informatics, so I greatly appreciate the opportunity.

Posted by: Paul Ohm | August 20, 2009 at 09:59 AM

Paul raises a number of valid comments in his response, so I will address these below.

I did not comment on the other studies originally because that would bring up a whole set of new issues into the discussion, but here goes. I must initially admit a bias towards health information and that is the lens I use.

In terms of other examples not in Paul Ohm’s paper of re-identification, which are mentioned in our k-anonymity paper ( see there are the Chicago homicide database, the Illinois department of public health, and the Canadian adverse event database and the CBC. There is also another one mentioned in our IEEE S&P paper of a prescription database (see

So there are many examples of re-identification. All of these were due to inappropriate de-identification of the data before disclosure. So they are all consistent with my earlier point.

The Netflix example is a bit different because it pertains to a ‘transaction’ or very high dimensional dataset. This presents its own set of difficulties. There are datasets that look like that in a medical context as well (e.g., when one considers diagnoses and drugs). Recently there has been quite a bit of activity on developing techniques for assessing risk and de-identifying transactional datasets, and there is some unpublished work on this in the healthcare context that should be coming out within the next 12 months or so (by our group and others). Therefore, the main point about lousy de-identification before releasing data remains in that example as well.

So, I would not argue that any of these examples represent a “truly sophisticated custodian” at all (at least within the context of the examples).

The Brickell work is very interesting, but it is not the last word on the issue. Here are a few considerations. All of the examples of re-identification that I cited above are of identity disclosure. Therefore, one can argue that this is really what matters because that is what today’s intruders are doing. The Brickell paper measured risk in terms of attribute disclosure. De-identification criteria like k-anonymity do not address attribute disclosure, so naturally k-anonymity algorithms will not perform well on a criterion that they are not addressing. The most commonly used value for k=5, and they did not really look at that. Of course, it would be nice to see the same results on multiple datasets rather than a single one. We have been doing this for a few years and we have de-identified datasets that have been used by researchers, commercial entities, and policy makers (see for a recent example). Also, one can quibble with the way the researcher vs. attacker variable selection / workload is defined.

In practice de-identification has to be included in a more general risk assessment framework, which is similar to the conclusion that Paul Ohm reaches (albeit I would keep de-identification in as part of the framework). We have developed such a framework (see which includes motive, invasion-of-privacy, and security & privacy practices of the recipient. This can serve as a starting point for a discussion.

Posted by: kelemam | August 20, 2009 at 01:15 PM

Khaled,

Thank you for the detailed response. As I read it, your response lends much more support than refutation for my paper. Let me try to summarize what you have said:

1. I could have cited three other examples of reidentification in my paper, which would have piled on support for my point.

2. The Netflix study involves transaction data, which is directly applicable to medical privacy questions...

3. ...but unpublished studies will demonstrate that we might have new techniques to protect these. Maybe.

4. So although I have pointed to a half-dozen examples of sophisticated, well-resourced companies and government agencies performing woeful anonymization, this demonstrates only that lots of people do a lousy job anonymizing, most of the time.

5. The theoretical work demonstrates that k-anonymity does not work well on attribute disclosure. k-anonymity does work well against identity disclosure.

[My response to this one: why should this distinction matter to policymakers? If attribute disclosure reduces entropy which can be used to destroy identity, shouldn't we consider this a huge flaw that deserves regulatory response? Isn't it as if you are arguing, "Who cares that we can destroy privacy by looking through the windows, aren't you impressed by how hard it is to look through the door?"]

6. You agree completely with my prescription: a nuanced risk assessment.

So, in summary, we agree about almost everything, and on the small things on which we might disagree, you principally point to unpublished studies. I'm very happy that we agree about so much.

Why then on another blog ( do you say, "The case is not as strong as it initially seems"?

Posted by: Paul Ohm | August 22, 2009 at 12:18 PM

And thank you for the link to your framework. I had not seen it, and I will be sure to incorporate it into my next draft. It is impressive, and it once again shows how close we are on this topic.

Posted by: Paul Ohm | August 22, 2009 at 12:21 PM

Paul, I do apologize if I was not very clear.

I think my main point remains that the examples you mention in your paper, plus the ones that I mention, do not support the main title of your paper (and the main thesis of the first half or so). They are not examples of anonymized data being re-identified. They are examples of data that have not been anonymized in any meaningful way being re-identified, which I think leads to a very different conclusion. That is really the key point.

One alternative conclusion from these half a dozen examples would be the need to have better guidelines and enforcement of best practices for de-identification because clearly many people are disclosing data without proper de-identification and when they do that it is easy to re-identify.

Regarding the de-identification of transaction data (such as Netflix), there are at least half a dozen different techniques published in the computer science literature on how to de-identify such data already. The unpublished work pertains to applying some of those ideas to health data.

I cannot tell you how many times a custodian told me that they have anonymized a dataset and after close inspection it turns out that they basically just removed the names. So claims that a dataset are anonymized, without producing concrete evidence that it was properly done and how, to me at least, are meaningless.

Re your response to point 5 – fair enough. But the issue around this point is which technique is best – and we can have a much longer conversation on that one at some future point. It still does not support the argument about anonymization failures as there are techniques to protect against attribute disclosure as well.

(I am using “de-identification” and “anonymization” interchangeably – it is Saturday and I should really be out playing with the kids, so I am taking shortcuts today)

Posted by: kelemam | August 22, 2009 at 03:16 PM

Thank you, Khaled. This has been a very fruitful exchange, and I hope it helps others understand the nature of this debate better. I have the feeling you and I will have many other chances to discuss and debate this topic in the coming months and years. I look forward to it!

I, too, should be playing with my kids on this Saturday, but unfortunately, I'm stuck in my office instead!

Posted by: Paul Ohm | August 22, 2009 at 03:24 PM

[Reformatted from the old EHIL blog into a word doc. on August 13, 2014]