The concept of identifiability is critical to managing the privacy risks when collecting, using, and disclosing personal information. In this article we therefore present a framework to reason about this concept, and then at the end some important implications are discussed.
It is important to distinguish between directly identifying variables, quasi-identifiers, and sensitive variables. This knowledgebase article provides some definitions.
If the disclosed database contains directly identifying variables then it is clearly personal information. However, if a database contains quasi-identifiers it can still be personal information. This is a very important point when managing risks from holding sensitive health information. The real examples provided in the table below show how individuals’ identities could be determined by using only the quasi-identifiers. In none of these examples was any directly identifying information included in the database, but it was still possible to determine the identity of at least one individual (and in some cases most of the individuals in the database).
| Example | Details
|
|---|
AOL search data
| AOL put anonymized Internet search data (including health-related searches) on its web site. New York Times reporters were able to re-identify an individual from her search records within a few days. |
Chicago homicide database
| Students were able to re-identify a significant percentage of individuals in the Chicago homicide database by linking with the social security death index. |
Netflix movie recommendations
| Individuals in an anonymized publicly available database of customer movie recommendations from Netflix are re-identified by linking their ratings with ratings in a publicly available Internet movie rating web site. |
| Re-identification of the medical record of the governor of Massachusetts | Data from the Group Insurance Commission, which purchases health insurance for state employees, was matched against the voter list for Cambridge, re-identifying the governor’s health insurance records. |
| Southern Illinoisan vs. The Department of Public Health | An expert witness was able to re-identify with certainty 18 out of 20 individuals in a neuroblastoma data set from the Illinois cancer registry, and was able to suggest one of two alternative names for the remaining two individuals. |
Canadian Adverse Drug Event Database
| A national broadcaster aired a report on the death of a 26 year-old student taking a particular drug who was re-identified from the adverse drug reaction database released by Health Canada. The national broadcaster matched the information from the report to the publicly available obituaries for that area of Ontario. |
| Prescription and diagnosis records of a patient re-identified | A neighbour re-identifies the records of a hospital patient by knowing her approximate age, gender, approximate admission date, and postal code. |
The implication from this observation is that data may still be personal information even after the removal or obfuscation of the directly identifying variables.
Five Level Model
In a recent article we elaborated on a five-level model of data identifiability: http://www.computer.org/portal/web/csdl/doi/10.1109/MSP.2010.103. Here we provide a brief summary of that model - please see the full paper for the details.
We can use this model to understand different types of data and the risks they imply.
Level 1 pertains to information that is clearly identifiable. For example, a database containing names, SSNs and financial transaction information about individuals would be clearly Level 1 on our identifiability scale. At this level no real effort is needed to re-identify an individual. If we have someone's name and address, we know who they are.
Level 5 pertains to information that is clearly not identifiable. Aggregate information consists of counts. For example, a table showing that 25 people have died from H1N1 pandemic influenza in Canada in November would be considered aggregate data.
Masked data (Level 2) has had some manipulations done to the identifying variables. However, with masked data nothing has been done to obfuscate the quasi-identifiers. A more detailed discussion of masking methods is provided here: http://www.ehealthinformation.ca/documents/DeidTechniques.pdf. Because the quasi-identifiers are not touched in Level 2 data, this is effectively still personal information.
The difference between Level 2 and Level 3 is that in the latter the organization attempts to obfuscate the quasi-identifiers as well as the identifying variables. Level 3 data is very common. This level exists because many organizations do not use sound means to de-identify the quasi-identifiers, and therefore they do a poor job at it. For example, in a Canadian context, the date of birth and full postal code uniquely identify many Canadians living in urban areas, making that combination of quasi-identifiers very high risk for re-identifying individuals. Reducing the precision of the postal code to a five character postal code does not actually reduce the risk of re-identification, but is quite a common practice. We termed this level “Exposed” because the organization may believe that they have de-identified the data, but in fact the risk of re-identification is still high. Therefore, the data custodian has a high risk exposure and may not know it (i.e., the data custodian will not have put in place any controls to mitigate those risks).
It should also be noted that re-identifying "Exposed" data does not indicate anything about methods for de-identification. In fact, one would expect that "Exposed" data would be quite easy to re-identify.
With level 4 data an objective assessment of the re-identification risk is performed and a data custodian can substantiate claims that the data is properly de-identified. Level 4 data can be microdata or in tabular form; the same point about objective risk assessment applies for tabular data. It is only at level 4 that data moves from being personal information to not being personal information. Level 4 data is called "Managed" because the risk of re-identification is managed by the data custodian. The risk of re-identification may vary across organizations that disclose Level 4 data, but in all cases it is managed. This means that the custodian knows what the risk of re-identification is objectively (i.e., the custodian has measured it), and has taken reasonable actions to manage that risk.
Data at level 4 and level 5 present the least risk to the organization because a strong case can be made that this is not personal information any more.
Re-identification, Effort, and Skill
As data moves up the scale more effort is needed to re-identify them. Therefore, even though level 2 and level 3 data are still considered personal information and the risk of re-identification would be considered quite high, the amount of effort to re-identify an individual is also higher as data moves up.

For example, consider a data set with full name and address, and the date of birth. This would be a level 1 data set. The level 2 version of this data set that has had its identifying variables masked, where all we have left is a postal code and date of birth. With some effort those two demographics variables can be linked to a person (e.g., a full name and address). This re-identification is more effort than the level 1 version of that data where we already had the name and address. Similarly, level 3 data requires more effort to re-identify than level 2 data.
As a corollary, higher level data also takes more expertise and skills in re-identification to re-identify than lower level data. Therefore, a lay person can re-identify a level 1 data, but it is quite likely that an intruder skilled at re-identification would be needed to re-identify data at levels 2 and 3. However, with effort and skill the probability of successful re-identification is high.
For data at level 4 and level 5, even an intruder with significant effort and skill will have a lower probability of re-identifying individuals in the disclosed data set. This is represented in the graph above as a disproportionate increase in resources and skills needed by an intruder to re-identify that kind of data. It is not impossible, but the pre-requisites have increased dramatically.
Furthermore, level 4 and 5 data have a built-in disincentive for an intruder in that it is not worth it for them to invest their skills and time into re-identification if the probability of a successful match is low. The value of the re-identified data, whether economic, political, notoriety, or based on some other criterion, would have to be quite high for an intruder to invest the necessary resources to attack a level 4 or level 5 data set. One
consequence of this is that few people would be able and willing to
re-identify a level 4 or 5 data set. If a level 4 or level 5 data set is lost or stolen,
it is not an automatic consequence that an attempt will be made to re-identify the data.
Another important consideration when we talk about re-identification effort is whether we are concerned about the re-identification of a single individual or many individuals in the disclosed data. Of course, the re-identification of many individuals will require more effort than a single individual. Therefore, when we talk about re-identification effort, we mean a normalized effort (i.e., effort to re-identify a single individual) so that the five level model applies irrespective of the data set size.
Example
As a concrete example, let us consider a hypothetical clinical data set with the following variables: first name, last name, health insurance number, street address, six character postal code, date of birth, date of doctor’s visit, and whether the individual has a sexually transmitted disease. In the five level framework, the data sets would be:
Level 1. The full data set as is.
Level 2. The names are replaced by fake names, the health insurance number is replaced with a fake number, and the street address field is removed altogether.
Level 3. The data set at Level 2 also has the postal code generalized from six characters to five characters. The risk at Level 3 is the same as Level 2, but the organization believes it has de-identified the data and discloses it. Therefore, the organization is exposed.
Level 4. The data set at Level 3 is further modified by replacing the 5 digit postal code with a single character postal code, the date of birth is replaced by age, and the date of visit is replaced by the month of the visit. A re-identification risk assessment is then performed on this data set and the risk was found to be below a pre-specified threshold.
Level 5. The number of individuals with a sexually transmitted disease.
In this example, the data in the last two levels would be considered de-identified data, but the first three would still be personal information.
Data Protection
Arguably, it would require less investment and resources to protect levels 4-5 data compared to levels 1-3 data. This is one of the key advantages, from an organizational perspective, of de-identification.
Acknowledgements: the original idea for this model came out of discussions with Craig Earle of the Ontario Institute for Cancer Research, and this is an adaptation & extension of a model he presented at the PHIPA conference in Toronto in the Fall of 2009.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.