What We Do

What does the Electronic Health Information Laboratory do?

The Electronic Health Information Laboratory (EHIL) was established by the CHEO Research Institute in collaboration with the University of Ottawa in 2005 by the founding director – Dr. Khaled El Emam. The objective of EHIL is to facilitate the sharing of health information for secondary purposes. These secondary purposes include research, public health, data science, comparative effectiveness evaluation, and pharmacovigilance.

The digitization of health information has meant that large amounts of data are readily available for secondary purposes. There is increasing demand for that data from different types of users, including researchers and public health professionals. Privacy legislation and regulations in most jurisdictions would only allow the disclosure of health information for secondary purposes if it is non-personal or if patient consent is sought. In addition to the difficulty in getting patient consent in practice, there is compelling evidence that consenters and non-consenters differ systematically on important demographic and socio-economic characteristics. Therefore, sharing non-personal information is the most practical solution to allow such disclosures.

EHIL develops technology to facilitate health data sharing, including data synthesis methods, de-identification methods and secure computation methods to allow surveillance or analysis without compromising privacy. The different methods are suitable under different circumstances and constraints, from individual-level data release, on-going surveillance, to interactive remote analysis. EHIL’s research spans:

  • Theoretical work (which consists of developing mathematical models and metrics of re-identification risk),
  • Empirical work (evaluations of our models and metrics through simulations and controlled studies),
  • Applied work (evaluations on large data sets), and
  • Knowledge translation (building software tools, instruments, and education).

Specifically, our work in this area has consisted of:

  • Performing empirical risk assessments on health information “leaks”,
  • Developing metrics to evaluate re-identification risk for clinical and geospatial data,
  • Developing methods and models for the generation of synthetic data,
  • Developing algorithms to de-identify large data sets, including longitudinal, cross-sectional, and free-form text data,
  • Developing secure computation methods and tools that allow sophisticated analysis, disease surveillance and linking of registries without sharing personal information.

Disclosure and Identification

What are the different types of disclosure risks?

There are two general kinds of re-identification risks that are of concern.

The first is when an intruder can assign an identity to any record in the disclosed database. For example, the intruder would be able to determine that record number 7 in the disclosed database belongs to patient Alice Smith. This is called identity disclosure.

The second type of re-identification is when an intruder learns something new about a patient in the database without knowing which specific record belongs to that patient. For example, if patients from a particular area in the emergency database had a certain test result, then an intruder does not need to know which record belongs to Alice Smith, if she lives in that particular area then the intruder will discover sensitive information about her. This is called attribute disclosure.

All known examples of re-identification are identity disclosures. Therefore, a strong case can be made that this is the type of disclosure risk that we should focus on first, and should try to manage.

There are three sub-types of risk under identity-disclosure: (a) prosecutor risk, (b) journalist risk, and (c) marketer risk.

What is a quasi-identifier?

The primary type of disclosure risk that needs to be focused on is identity disclosure. An underlying assumption for this type of risk is that there is an intruder who has two pieces of information: (a) the actual data set that has been disclosed and (b) some background information about one or more people in this data set. The background information is described by a set of variables. These variables are the quasi-identifiers.

Examples of common quasi-identifiers in the context of health information are:

  • dates (such as, birth, death, admission, discharge, visit, and specimen collection),
  • locations (such as, postal codes, hospital names, and regions),
  • race and ethnicity,
  • languages spoken,
  • aboriginal status,
  • profession,
  • and gender.

This set of variables may expand if more data are collected about the public and more public registries are made available. For example, in many jurisdictions in the US the voter lists are publicly available. This means that the basic demographics (such as full ZIP code, date of birth, gender, all included in the voter list) are automatically considered quasi-identifiers.

Individuals may also post personal information about themselves on web sites or announce that information to their friends. For example, many new parents announce the exact birth weight of their new child. When we examined birth registries, we found that weight, hospital of birth, and age of mother make most births unique. Therefore, weight is quite a powerful identifier.

It is simply wrong to make a general statement that all variables in a disclosed data set are quasi-identifiers. The reason is that it is often not plausible for an intruder to gain background information about all variables in a data set. For example, using legitimate means in Canada it would be very difficult for an intruder to obtain all of the diagnosis codes for a patient and use these for re-identification. It is very difficult for an intruder to get a complete set of lab test result values and use these for re-identification.

One of the skills required in managing re-identification risk is to decide the plausible quasi-identifiers that need to be considered. This analysis looks at the jurisdiction and the types of information that is publicly available.

Therefore, to summarize, a quasi-identifier is a piece of information that an intruder can get hold of about a specific target individual or about a large number of people through the following means:

  • Personal knowledge of the specific target person (e.g., a neighbor, co-worker, ex-spouse).
  • The specific target person is famous and there is information publicly available about them.
  • Publicly available registries (e.g., voter lists and court records) or the media (e.g., obituaries published in newspapers or on-line).
  • Information that individuals post about themselves on the Internet (e.g., information they post on social networking sites).
  • Information that individuals often disclose to a large number of people (e.g., their baby’s birth weight or birth date).

It is also important to remember that it is possible to predict a quasi-identifier from another variable. In this case, both of the variables must be considered quasi-identifiers. There is no point protecting against a variable A but not variable B, and the intruder can easily predict A from B. Therefore, it is important to search for correlated variables in a data set. Examples of correlated variables are:

  • Date of birth of a baby and date of discharge from a hospital.
  • Date of death and date of an autopsy.
  • Weight at birth and weight of baby at discharge from a hospital.
  • Age and date of graduation.
What is the difference between consenters and non-consenters?

We have just completed two large systematic reviews looking at the difference between consenters and non-consenters. The review considered clinical trials and observational studies with primary data collection and secondary use of existing databases. For the clinical trials there were already many systematic reviews done, so this was a systematic review of systematic reviews.

The main question was whether the participants who agree to be part of a research study differ in any systematic way than those who refuse to participate. There was  compelling evidence that requiring explicit consent for participation in different forms of health research can negatively impact the process and outcomes of the research itself: a) recruitment rates decline significantly when individuals are asked to consent (opt-in vs. opt-out consent, or explicit consent vs. implied consent), (b) in the context of explicit consent, those who consent tend to be different from those who decline consent on a number of variables (age, sex, race/ethnicity, marital status, rural versus urban locations, education level, socio-economic status and employment, physical and mental functioning, language, religiosity, lifestyle factors, level of social support, and health/disease factors such as diagnosis, disease stage/severity, and mortality) hence potentially introducing bias in the results, (c) consent requirements increase the cost of conducting the research and often these additional costs are not covered, and (d) the research projects take longer to complete (because of the additional time and effort needed to obtain consent, as well as taking longer to reach recruitment targets due to the impact on recruitment rates).

The review itself is here. This was an appendix to the following article, which provides the citation as well.

In practice this means that ethics boards need to consider bias as another factor in deciding whether it is appropriate to waive consent. Even if it is possible to obtain consent, would the impact on study quality be sufficient to allow a consent waiver? For example, if we are looking at a syndromic surveillance study of H1N1 influenza, a bias against say, pregnant women would be devastating because they are an important group of patients who are affected.

In the context of secondary use or database studies, one option that should then be considered is to de-identify the data if consent is waived. There is no regulatory requirement to obtain consent for de-identified data and many research ethics boards will not require consent if the data is appropriately de-identified.

What is the difference between prosecutor and journalist risk?

Disclosure risk can be characterized as prosecutor risk or journalist risk. These are just colorful names for two common types of risks. They are similar in that they both pertain to the risk of an intruder re-identifying a single individual in the data set that is being disclosed. But they also make different assumptions about the nature of the attack. More definitional information and examples are provided in the attached document.

With prosecutor risk we assume that the intruder is trying to re-identify the record belonging to a specific person. This specific person is known to the intruder. For example, this specific target person may be the intruder’s neighbor or a famous person. The intruder has some background information about that target, and then uses this background information to search for a matching record in the disclosed database.

If any of the three following conditions are true, then prosecutor risk is a threat:

The disclosed dataset represents the whole population (e.g., a population registry) or has a large sampling fraction. If the whole population is being disclosed then the intruder would have certainty that the target is in the disclosed data set. Also, a large sampling fraction means that the target is very likely to be in the disclosed data set.

The dataset is not a population but is a sample from a population, and if it can be easily determined who is in the disclosed dataset. For example, the sample may be a data set from an interview survey conducted in a company and it is generally known who participated in these interviews because the participants missed half a day of work. In such a case it is known within the company, and to an internal intruder, who is in the disclosed data set.

The individuals in the disclosed data set self-reveal that they are part of the sample. For example, subjects in clinical trials do generally inform their family, friends, and even acquaintances that they are participating in a trial. One of the acquaintances may attempt to re-identify one of these self-revealing subjects. Individuals may also disclose information about themselves on their blogs and social networking site pages which may self reveal that they are part of a study or a registry. However, it is not always the case that individuals do know that their data is in a data set. For example, for studies where consent has been waived or where patients provide broad authorization for their data or tissue samples to be used in research, the patients may not know that their data is in a specific data set, providing no opportunity for self-revealing their inclusion.

In the above conditions the term population is used loosely. It does not mean the population of a geographic area, but the group of people who have a specific known characteristic. For example, a data set of all patients with renal cancer for a province would be a renal cancer population registry since everyone with renal cancer would be in that registry. A data set of all patients with a particular disease within a geographic boundary, or that have a particular demographic (e.g., ethnicity, language spoken at home, age group) would be considered a population and therefore the data set would meet criterion (1) above.

If a data set does not meet the above criteria, then you should be concerned about journalist risk and not prosecutor risk (i.e., it is either one or the other, not really both). The distinction between the two types of risk is quite important because the way risk is measured or estimated does differ and there can be a big difference in the risk assessment results based on which type applies.

It is often the case that custodians hold a certain type of data, and therefore, once they have decided that their data falls under prosecutor or journalist risk, they can apply that type of risk assessment moving forward.

What is the relationship between prosecutor, journalist, and marketer risk?

In the context of identity disclosure, we are concerned with managing three kinds of risks: prosecutor risk, journalist risk, and marketer risk. These three types of risks can be measured objectively.

When the risk of re-identification is measured, the following conditions hold true numerically:

  • prosecutor risk will be equal to or larger than journalist risk
  • journalist risk will be equal to or larger than marketer risk
  • prosecutor risk will be equal to or larger than marketer risk

These relationships are important because they mean if a data custodian manages prosecutor risk (i.e., ensure that prosecutor risk is below some pre-defined threshold), then that would imply that the journalist risk and the prosecutor risk are also, by definition, lower than the threshold. This also means that managing prosecutor risk also manages journalist and marketer risk at the same time, and managing journalist risk manages marketer risk at the same time.

Practically, it means that if, say, prosecutor risk and marketer risk are both relevant for you, all you need to do really is manage prosecutor risk and marketer comes along automatically. This is the case as long as the same threshold is used for both kinds of risk. If different risk values are used then managing one risk does not necessarily mean that another is managed.

What quasi-identifiers should I use for managing journalist risk?

With journalist risk the intruder is not looking for a specific person in the disclosed data set; re-identifying any person will achieve the goal. A classic example is the reporter who is going through a leaked medical database to find someone with a sensitive disease or condition. Once the reporter finds that person, then he’ll try to re-identify that person. The reporter did not have a specific person in mind to start off with.

We assume that the reporter has access to another database that can be used for matching. For example, in the US this database may be a voter list. The voter list will have basic demographics and the voter’s name and address. The intruder matches the records in the database with the disclosed data set and for correct matches individuals are re-identified. This database used for matching is called an identification database.

The quasi-identifiers under journalist risk are those that can be found in the identification database. There three different types of identification databases:

  • A public database that is available freely and without conditions.
  • A semi-public database that may require a fee to access or that has conditions on its uses.
  • A private database, which is in the possession of the intruder or can be bought from a data broker for a fee.

It is difficult to predict all of the private databases that can exist; therefore we often focus on the public and semi-public ones. Our research has documented the quasi-identifiers that are available in Canada through public sources and that can be used for re-identification. These variables include:

  • Date of birth and date of death.
  • Profession.
  • Home address and telephone number.
  • Type of dwelling. This information can be obtained in aggregate from Statistics Canada or by looking at Google Maps.
  • Gender: If an identification database does not have that, genderizing software can predict it from the first name.
  • Ethnicity. If an identification database does not have that, ethnically sensitive genderizing software can predict it from the first and last names.
  • Incomes for highly paid civil servants.

In Canada the voter lists are not publicly available. Therefore only specific people are really at risk of re-identification through a journalist type attack using public sources of information: (a) homeowners, (b) members of a profession which publishes its membership lists, and (c) civil servants. Examples of specific public sources are:

  • Obituaries. These are available from newspapers, funeral homes, specialized tombstone sites, and obituary aggregation sites.
  • The Private Property Security Registration database. This is available from provincial governments, either directly on-line or through a local agent.
  • Land Registry. This contains information on house ownership.
  • Professional membership lists, for example, for doctors and lawyers.
  • Salary disclosure reports from governments.
  • White pages. These do not include cell phone numbers (yet).
  • On-line CVs. Job sites provide a lot of very detailed information. Individuals also post CVs on their personal sites and pages.
  • Donations. These include donations to political parties which are disclosed by Elections Canada on their web site.

Some of the above data sources become useful as identification databases when they are linked together rather stand alone.

It should also be kept in mind that many public data sources have a fee, and to create a meaningful identification database can be quite expensive. This presents a deterrent for an intruder to attempt journalist type re-identification unless the re-identification will have even higher returns. As an example, searching the PPSR in Ontario cost $8 each time. Therefore, to create an identification database for active Ontario physicians (~23,000) would cost $184,000. This makes it important to consider the plausibility of creating certain types of identification databases when evaluating journalist risk. For many intruders, this kind of expense would not be worth it.

What quasi-identifiers should I use for managing prosecutor risk?

If you are trying to manage prosecutor risk, then you assume that the intruder has a specific target person in mind and is trying to re-identify that person’s records in the disclosed data set. The intruder is also able to get some background information about that target person. This type of background information represents the quasi-identifiers you are interested in. The type of background information we assume depends on the intruder. The types of intruder we usually consider when managing prosecutor risk are: (a) a neighbor, (b) an ex-spouse, (c) an employer or colleague at work, (d) a relative, and (e) a stalker.

In a worst case scenario, a neighbor would know:

  • Address and telephone information about the target individual
  • Household and dwelling information (number of children, value of property, type of property)
  • Key dates (births, deaths, weddings, admissions, discharges)
  • Visible characteristics: gender, race, ethnicity, language spoken at home, weight, height, physical disabilities
  • Profession

Of course, not all neighbors are friendly or nosy, and therefore, a particular neighbor may not know all of the above things. But these are plausible things that a neighbor would know by observing and casually interacting with the target individual and their family.

What an ex-spouse would know includes:

  • The same things that a neighbor would know
  • Basic medical history (allergies, chronic diseases)
  • Income, years of schooling

An employer or relative would generally know less than the above two.

A stalker could be after a famous person or an estranged spouse or boy/girlfriend. The quasi-identifiers that a stalker would be the same as an ex-spouse for the latter case, and whatever information is publicly available about a famous person in the former case. We generally make an assumption that an ex-spouse would have the most background information.

If any of the above variables exist in the disclosed data set, then you should take them into account in the re-identification risk analysis.

But we also have to be pragmatic. For example, there are no easy ways to de-identify diagnosis codes. Therefore, if they exist in the disclosed data set and represent a high re-identification risk, then that risk may be better mitigated using a data sharing agreement and audits (see our risk assessment methodology) rather than through de-identification.

Which variables can be used to identify an individual?

It is important to be able to categorize variables in a data set according to their role in re-identification because it helps us reason about risk. Below is one categorization that we have found useful.

The specific scenario we are looking at is that of a data custodian disclosing health information for secondary purpose. The data custodian needs to be able to assess the identifiability of the data. We assume that the data in question has variables and that each record pertains to an individual. This individual can be a customer, an employee, or a patient. The variables can be divided into four groups:

Directly Identifying Variables. One or more of these variables can be used to uniquely identify an individual directly either by themselves or in combination with other readily available information.

For example, a person’s name is a directly identifying variable. A person’s name does not necessarily identify an individual uniquely, however. There are more than 280 men named “John Smith” in Ontario, therefore in this case the full name by itself cannot identify the individual. By combining the name with another directly identifying variable, such as a phone number, then an individual can be identified uniquely. There will be exceptions of course because some names are very unique.

As another example, consider an email address. An email address uniquely identifies a person because it is almost always the case that a person is assigned a unique email address. An email address “john.smith@myco.com” will pertain to a single individual working at MyCo. Again, there will be exceptions: an email such as “banana55@hotmail.com” does not tell you much about an individual’s identity.

Some directly identifying variables can only be used for re-identification in conjunction with other information. For a variable to be directly identifying this other information has to be readily available, for example, the telephone white pages, a Google search, or access to a company database. For example, a Social Security Number (SSN) is directly identifiable if an intruder has a database containing SSNs. If an organization uses SSNs as unique identifiers for their clients and the intruder has access to that database of SSNs, then in that case an SSN is a directly identifying variable.

Other examples of directly identifying variables include telephone number, health insurance card number, credit card number, and social insurance number.

Therefore, whether a variable is a directly identifying variable will depend on the context. It will depend on what other information an intruder would plausibly have ready access to and it will vary by record. Since we need to make statements about the data, we will say that if for some of the records a variable is a directly identifying variable then we will treat it as a directly identifying variable for all of the records.

Quasi-identifiers. One or more quasi-identifiers can be used to probabilistically identify an individual, either by themselves or in combination with other available information.  This definition sounds quite similar to the one above, but there are two differences. Quasi-identifiers do not necessarily make an individual unique and the auxiliary information that is needed for re-identification may not be readily available.

For example, it is more difficult for an intruder to know an individual’s exact date of birth or exact date of admission to a hospital compared to knowing an individual’s name. Put another way, people are more willing to divulge their name and email address, say, than they are willing to divulge their date of birth and postal code. Therefore, it takes more effort, time, money, and skill to develop the background information to re-identify individuals using quasi-identifiers compared to direct identifiers.

A key condition for a variable to be a quasi-identifier is that it has to be plausible for an intruder to be able to get background information about the individuals in the disclosed data sets using the quasi-identifiers. This is an important condition for a variable to be a quasi-identifier.

Examples of quasi-identifiers include sex, date of birth or age, geocodes (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.

Sensitive variables. These may be variables that characterize, say, the financial or health status of an individual. It is information that is sensitive and that individuals would normally consider private. If there is no sensitive information in a data set then it is arguable whether there is anything to protect.

It is important to recognize that sensitivity is relative. For example, is “flu” sensitive? If an intruder re-identified a record in a disclosed database and found out that they have the flu, does that matter? This is a complicated question that has to do with harm, but a simple consideration is that an individual patient may feel violated by the fact that their health information could be re-identified and this may have an impact on their behaviour and level of trust in the custodian, even if the sensitivity of the information that was re-identified is low.

It should also be noted that the distinctions above will vary depending on the data set and the context. In one instance a variable may be a quasi-identifier and in another case it may be a sensitive variable. For example, the diagnosis code can be a quasi-identifier if there is a plausible way that an intruder can get background information about an individual’s diagnosis code and then it can be used for re-identification; otherwise it would be a sensitive variable.

Sometimes a variable may be a directly identifying variable and in another it may be a quasi-identifier. For example, a health insurance card number is directly identifying if the intruder has access to a database of patients and their health insurance card numbers. If the intruder does not have access to such a database then this would not be a direct identifier. Although, in practice health insurance card numbers are useful for fraudulent purposes, and therefore they should not be disclosed anyway, but the primary driver for not disclosing that information may not necessarily be the risk of re-identification of individuals.

Who cares about my medical records?

One question that is sometimes posed is “why would anyone want to re-identify my records?” The argument goes that if the medical records have no value to someone else, then why would anyone bother getting access to and re-identifying them?

Some medical records have financial information in them (e.g., information used for billing purposes) or information that is useful for financial fraud, for example, date of birth, address, and mother’s maiden name. In some cases in the US the medical record may contain SSNs (which are often used as a unique identifier). All of this information is useful for committing financial crimes. In general, your own identity information is not worth that much in the underground market. This tool can show the true monetary value of your personal data.

Therefore, medical records with such information are only useful in large quantities to make it worthwhile for someone to get them and use them. This means that intruders would be interested in databases of records rather going after an individual’s records.

If an intruder gets a poorly de-identified database with many records and it is plausible to correctly re-identify many patients in it, then the financial incentive may result in the intruder performing this re-identification.

Even if medical records do not have information in them that is suitable for financial fraud, if your record has information about your health insurance then it can be very valuable. Medical identity theft entails someone getting health care in your name. This is most likely to happen because a person has no insurance because they cannot afford it or because they cannot get it (illegal aliens, or individuals running from the law). A good example of that happening in Canada was described by Joe Pendleton in his presentation here.

The basic scenario is that of Americans who cannot afford certain procedures or are unable to get insurance would buy Canadian identities with health coverage, and come to Canada to have these procedures done. Also, Canadian health insurance numbers are useful for illegal aliens who cannot obtain it legitimately under their own identity.

If you ever become of interest to the media and they want to do a story on you or your family, then reporters may be interested in re-identifying records about you. An example of that happening (as documented in court documents) is the CBC re-identifying a patient who died while taking an acne drug by matching Health Canada’s adverse drug report data with obituaries. In this case the CBC wanted to do a story about the drug, and the 26 year old girl’s death was central to the message that the drug was harmful, so they needed to contact the family. They found multiple matches in the obituaries and contacted them all until the correct girl’s family was found and they had their story. However, in discussions with various members of the media my understanding is that if they put a radio or newspaper ad asking people to tell them about certain events, many members of the public self reveal themselves to the media or tell about their neighbors and family. Therefore, from a re-identification perspective, the media can gather background information about you easily by getting people who know you to provide them with that information.

Medical records are a good source of revenue if you are in the extortion business. One example is Express Scripts which lost a large database of customer data. Here is the initial news story. Basically, the company was using production data for software testing, and there was a breach on the testing side of the business. Unfortunately, using production data for testing is common. In any case, the initial extortion attempt was based on the breach of 75 records. It turns out later that 700,000 individuals may have been affected by the breach. In this case it is not clear how much the extortionists are requesting.

Here is another recent extortion attempt of medical records where the extortionist has requested $10m.

Even if there is no financial impact, some people feel violated if there is a breach of privacy of their medical information and change their behaviour by adopting privacy protective behaviours. These include things like not seeking care, lying to their doctor so as not to reveal embarrassing or sensitive information, seeing multiple doctors so no one will have a complete record, paying out of pocket so that insurers do not have a record of a particular encounter/procedure/prescription, self-treatment where individuals treat or medicate themselves rather than seeking care, and asking the doctor not to record certain pieces of information or to record different pieces of information (and many physicians admit to doing this).

When your physician is asked (with your consent) to provide your medical records to an insurance company, most of the time the physician will send them everything (i.e., s/he will not have time to remove information or select only the pieces of information required by the insurance company). There is also evidence that the most vulnerable people adopt these kinds of behaviours, such as teens, battered women, people with or at risk of getting HIV, and people with genetic conditions. As a sad demonstration of what people do to protect their privacy, here is an article from the New York Times.

There are a number of attempts to make health information publicly (or at least very widely) available. This is particularly true for research data. In the introduction to this article, we provide a review of the various initiatives. One argument made in support of these efforts is that data collected using public funds should be made available to maximize the return from the initial investment, and making data widely available means many more people can analyze it and discover new things from it. To the extent that this becomes the case, your health information may be more widely available if you participate in research initiatives. If that data is not properly de-identified then the chances of re-identifying your records would increase.

There is increasing interest by data custodians to package data, de-identify it in some way, and sell it. Here are a few examples:

  • Some vendors are providing Electronic Medical Records (EMRs) for free to practitioners, but then selling the data to generate revenue: read more here and here.
  • Some providers have seriously considered, are planning to, or are already selling data about their patients. This is done directly or by creating subsidiary companies responsible for the commercialization of data. For example, the Cleveland Clinic, and the Geisinger Health System.

The problem is that it is not clear whether this de-identification is sufficiently robust and whether these organizations have used de-identification best practices. In the examples cited above the organizations have not been forthcoming with details about how they have de-identified their data, which amplifies patient concerns about how their health information is being used. Such a lack of transparency is coupled with the fact that many patients would not know that their health information is being sold.

If you have enemies they may be interested in re-identifying your records and finding something sensitive about you.

For all of these reasons you should care about the secondary use of health information. The more data that is collected electronically, the frequency and volume of records affected will also be quite large. The number of medical data breaches and the number of records affected is unfortunately just getting larger.

Another version of the above reasoning can be found in the following Cutter Consortium report entitled “Managing Privacy Risks through Data Anonymization” targeted specifically to a CIO audience.

We have also written an article highlighting the risks to personal health information.

See also: 2009 – Public Accountability and Transparency – The Missing Piece

Possible Identifiers

Are Canadians identifiable by their age, gender, and residence?

For many studies the combination of age, gender, and residence Forward Sortation Area (FSA) are collected. (Residence FSA in Canada is the first three alphanumeric characters of the postal code.) In many datasets that are disclosed these three variables are included. Does that represent a privacy risk?

In one of our studies we analyzed the Canadian census (2001), and one of the questions that we attempted to answer was this one.

Our conclusion was that only a small proportion of Canadians are unique on these three variables (we use uniqueness as a measure of re-identification risk. There is variation across the country, with the largest percentage that is unique in New Brunswick. Here is a table showing the percentage of the population unique on these three variables:

Province Percentage of the Population Uniques
 Alberta  16%
 British Columbia  13%
 Manitoba  12%
 New Brunswick  49%
 Newfoundland  17%
 Nova Scotia  18%
 Ontario  9%
 PEI  10%
 Quebec  16%
 Saskatchewan  7%

An important question is whether or not these numbers are too high? Also, note as a caveat that these are estimates of uniqueness.

By most standards these numbers would be considered as high. One solution is to generalize the age into two, five, or ten year intervals, for example, but keep the FSA intact. The percentage of the population unique under these two modifications is as follows:

Province Percentage of Population Uniques with 2 Year Age Interval Percentage of Population Uniques with 5 Year Age Interval  Percentage of Population Uniques with 10 Year Age Interval
 Alberta  8%  4%  2%
 British Columbia  7%  1%  1%
 Manitoba  8%  5%  2%
 New Brunswick  41%  30%  25%
 Newfoundland  9%  5%  2%
 Nova Scotia  14%  7%  5%
 Ontario  4%  2%  1%
 PEI  3%  3%  3%
 Quebec  9%  4%  1%
 Saskatchewan  3%  3%  2%

Based on these results, we can say with some confidence that uniqueness is quite low with 10 year age intervals when age and FSA are also collected/disclosed. The exception is New Brunswick where uniqueness remains quite high even at a 10 year age interval. In instances where the custodian is comfortable (for example, because other actions are taken to manage re-identification risk) with the percentage of uniques for the 5 year age interval or even the 2 year age interval, then a custodian may disclose data on that basis. This recommendation is based on best available evidence today, and the percent uniques presented above should be seen as ceiling values on risk (i.e., they are conservative values). Using them means that you are being extra cautious.

Also note that there is considerable active research on this issue. Therefore, it is plausible that we will provide more updated and precise guidance on the disclosure of these demographics in the future.

Can a person be re-identified from their diagnosis code?

In many discussions about re-identification risk and de-identification the focus is on demographic variables. But many data sets also include diagnosis codes (for example, ICD-10 codes). We will answer the question on whether these can be used for re-identification by going through a number of scenarios. In all of these scenarios (below) we assume that the disclosed data set has demographics as well as at least one diagnosis code.

In the US, hospital discharge abstract data is publicly available. Also, many states make their voter lists available for free or for a modest fee. By linking the demographic information in the discharge abstracts with that in the voter lists, it is possible to construct an identification database containing names, addresses, dates of birth, gender, and diagnosis codes for patients. With that kind of background information, it is then possible to match the diagnosis codes and demographics with the same information in any data set that is disclosed and that has diagnosis codes. In this case, the diagnosis code effectively becomes yet another quasi-identifier. For example, if a hospital makes a data set of patients available for research, an intruder can create the abovementioned identification database and match against the research data set. This is a classic example of a journalist-type re-identification attack.

The inclusion of diagnosis codes makes the probability of correct re-identification much higher because each person will have more than one (in fact, many) diagnosis codes included in their discharge abstracts. A set of diagnosis codes can make an individual unique.

The underlying assumption with the above scenario is that the disclosed data set is for a particular institution and the intruder has created an identification database using public information for that same institution. The intruder then does the matching on institution-specific data sets. It is not clear how high the risk would be if there was no institutional information.

The above scenario would not happen in Canada because discharge abstract data is not easily available, and the organization which releases this information on a national basis, the Canadian Institute for Health Information, does implement disclosure control on that data and limits who gets access to it. Furthermore, voter lists are not readily available in Canada. It does not mean that journalist risk for discharge abstracts does not exist; only that it is very low and probably not the highest priority risk to focus on.

Another scenario is when some of the records in the disclosed data set have diagnosis codes for rare and visible diseases/conditions. If the data set has location information as well, such as the postal code or town where the patient lives, then having a diagnosis code for a rare and visible disease/condition means that it would be relatively easy to find the patient. A reporter or investigator can ask local people in that geography if they know a person who has the visible characteristics of the disease/disorder. Since it is rare, the individuals who have the disease/condition will stand out. This type of re-identification makes sense if the disclosed data set has many diagnosis codes and/or other sensitive information, otherwise an intruder would not discover something new by re-identifying the record.

The third scenario we will consider is when a diagnosis code can be associated with a genetic marker. For example, if a patient is diagnosed with Huntington’s disease, there is a clear genetic marker for that. An intruder who gets the DNA of a patient and determines that they have Huntington’s would then be able to use that as a diagnosis quasi-identifier to re-identify the individual in the disclosed data set. This scenario, however, assumes that the intruder has a specific patient in mind and has that patient’s DNA. The disclosed data set must also contain additional sensitive information beyond the fact that the patient has Huntington’s otherwise the intruder has not learned anything new.

Therefore, the answer to the question on whether diagnosis codes can be used to re-identify patient will depend in the circumstances described above.

Can a voter list be used for re-identification?

A lot of literature makes the point that voter lists can be used for re-identification. However, the accuracy of this statement will depend on your jurisdiction.

In the US many states make their voter lists available for free or for a small fee. Often there are few restrictions on what that information can be used for. The voter list would contain the name and address of voters, as well as their gender and date of birth. Some will contain additional information such as political affiliation.

The situation is quite different in Canada. It is not that easy to get voter lists in Canada, and legally, they can only be used for the purpose of elections and election related activities. However, there have been cases of volunteers in election campaigns, who use voter lists for canvassing, keeping the voter lists afterwards and making them available to third parties. This is exemplified by the following articles on how a charity allegedly supporting a terrorist organization obtained voter lists from volunteers:

  • Bell S. Alleged LTTE front had voter lists. National Post. July 22, 2006.
  • Bell S. Privacy chief probes how group got voter lists. National Post. July 25, 2006.
  • Freeze C, Clark C. Voters lists ‘most disturbing’ items seized in Tamil raids, documents say. Globe and Mail, May 7, 2008.

Often volunteers do not have to sign confidentiality agreements. Although, organizations like Elections Canada are trying to tighten this process up.

Another way to get access to a voter list in Canada, instead of volunteering or chasing down volunteers, is to become a candidate yourself. This is quite easy to do. For example, becoming a provincial election candidate costs $500 in Alberta, $100 in BC, $100 and nominations by 25 electors in New Brunswick, $100 in Ontario, and $0 in Quebec but it requires nominations by 100 electors. Although legally, once a candidate obtains the voter list, it cannot be used for re-identification purposes.

Another important point is that Canadian voter lists do not contain the date of birth, which makes them of limited value for re-identification by themselves. However, when combined with other sources of public information they can still be very useful for re-identification even without the date of birth.


Can individuals be re-identified from disease maps?

Increasingly, public health units, the media, and researchers are publishing or posting maps on the web showing locations of individuals with particular diseases. Do these maps represent a high re-identification risk?

There have been studies showing that published maps which contain point locations of individuals or households with a particular disease can be reversed engineered to determine the original location, even if the published map is low resolution and certain landmark and geographical features are removed. Therefore, as a starting point, the risk of re-identification would be high if individual points are published. One can perturb these published points rather than publish the original points, but we’ll leave perturbation techniques for another article.

In some cases prevalence rates for a particular area are published. The following are good examples of disease maps published by The Toronto Star for various sexually transmitted diseases. The rates are published per FSA:

Do these maps risk identifying any of the individuals? There are three questions that need to be answered to determine the risk:

  • Is the disease visible?
  • Is the disease rare in the geography?
  • If I re-identify an individual, will I learn something new about them?

If the disease is not visible then there is really little risk because there are no plausible scenarios for going from a prevalence rate for an FSA to an individual. If we consider infectious syphilis, the first sign usually appears 2 to 10 weeks following exposure, and a red, oval sore, called a chance, develops at the site where the bacteria entered the body. These could appear on the mouth, hands, and most likely on the genitals. However, if they appear on the mouth or hands then an argument can be made that it is visible. For HIV, facial muscle wasting would be quite visible as well (see the pictures here).

Rareness of a disease can be defined as a prevalence rate of less than 10 in 10,000. Therefore, for any FSA in those maps where the prevalence is greater than 1 (in 1000), we would consider that not being rare (e.g., the FSA “M4Y” has a prevalence rate higher than 1 for infectious syphilis). In general, however, we can see that within the FSAs, most of the diseases are quite rare by that definition.

This definition of rareness is different from the one used by statistical agencies. For example, often if the prevalence is less than 0.5% then statistical agencies consider that rare and suppress those records or apply some other disclosure control actions. This is the usual procedure for high age values that are top coded at 90 years old. In Canada about 0.5% of the population is older than 90.

For our purposes, we will use the 10 in 10,000 definition, which is more conservative.

Therefore, if we take say HIV, it would be rare and visible in the FSA “M2L”. Now, let us consider a re-identification scenario. An intruder would go to “M2L” with a picture of facial wasting and ask the people living there if they know or have seen someone who match these characteristics. A neighbor could then say that Bob looks like that. In that case the neighbor would learn that Bob has HIV, and the reporter would find a person with HIV.

The local newspaper can publish an article on facial wasting or the neighbor may read an article on facial wasting in a book and realize that Bob probably has HIV. One can then argue that the neighbor learned something new from generally available information, and that helped him recognize a visible condition that Bob has. In that case the published map information has no bearing on the neighbour’s recognition that Bob has facial wasting.

The reporter can also walk down the street until he finds Bob (someone with facial wasting) and have no neighbor involved. In this case, the existence of the prevalence map indicated that there is at least one person in that FSA. Therefore, if the map was binary (zero and greater than zero), that would be all the prevalence information needed to encourage the identification of Bob. The reason is that the reporter would not bother going into “M2L” if the prevalence was known to be zero. However, the map created an incentive but did not provide a link to Bob’s identity.

On the other hand, if the reporter has another database which has, say, financial information on people with HIV living in “M2L”, then figuring out which record belongs to Bob means that the reporter will learn something new about Bob. The geographic specificity, the rareness and the visibility of the disease make it easier to correctly link Bob to his record in that database.

Therefore, whether something new can be learned depends on whether there exists a database that has geographic specificity on individuals with that rare and visible disease, and this database contains additional sensitive information about those individuals. If these conditions are met, then a stronger case can be made that the disease maps can reveal something new about the affected individuals.

Another scenario is if the prevalence rate is high. Say if 900 in 1000 people in a particular area have a condition. This would mean that there is a very high probability that everyone who lives in that area has the condition. Anyone looking at the map would learn something new and personal about people living in that area. The people living in that area would not be re-identified per se, but sensitive information about them would be disclosed through the map.

To summarize then, here are the two scenarios where a disclosure risk is plausible and potentially high:

  • If the disease/condition is rare (low prevalence), it is visible, and there is a database with other sensitive information in it.
  • If the disease/condition is very common (very high prevalence) that almost everyone who lives in that area has it.
Can postal codes re-identify individuals?

Postal codes are the smallest geographic unit that is used by Canada Post to deliver mail. In a health care context they are the most common geographic unit because that is what patients know and are able to provide. Therefore it is often collected.

The re-identification scenario that is relevant here is where there is a data set which is being disclosed, and this data set contains the postal codes of individuals and some sensitive information, say an indicator of whether a person has a sexually transmitted disease. The postal code is the only demographic information that is being disclosed in this data set. Does this represent a high re-identification risk?

The median number of people who live in a postal code is quite small, as shown in the table below. This uses 2006 census data. The first observation is the wide variation in the postal code sizes within provinces and across provinces.

For example, if there are 20 people who live in a postal code, does that represent a re-identification risk? If that is the only demographic information available, then an intruder would guess, and the probability of making a correct guess that a record belongs to a particular person would be 1 in 20. By most standards used for managing re-identification risk, this would be considered a small number.

Province/Territory # Postal codes Min 25th Percentile Median 75th Percentile Max
 AB  77,348  1  5  24  50 7,084
 BC  113,222  1  6  19  40  13,537
 MB  24,015  1  6  25  49  6,298
 NB  57,389  1  3  8  17  1.971
 NL  10,376  2  7  18  39  5,506
 NS  25,332  1  5  13  29  8,983
 NU/NWT  535  2  14  33  82  5,794
 ON  270,277  1  7  21  47  17,165
 PE  3,165  2  5  12  26  8,327
 QC  203,637  1  5  17  39  12,635
 SK  21,563  1  6  22  36  6,939
 YT  935  2  2  12  33  2,107

But if we look at that table again, we see that 25% of the postal codes in New Brunswick have a population of 3 or less. Guessing that a person matches a record with a success probability of 1 in 3 is quite high and would be considered a high re-identification risk. And this high risk applying to 25% of the postal codes would be a problem. Similarly, 25% of the postal codes in Alberta have a population of 5 or less. The smallest postal codes in all provinces and territories have very few people living there. Any information about the postal code would pertain to a very small number of individuals.

Therefore, whether a postal code by itself can represent a high re-identification risk will depend on where in the country one is located. Some postal codes have very few people living in them. In that case knowing the postal code narrows down the options to very few individuals that there is a good chance that guessing will be correct.

At the other extreme, if we have a data set where everyone in a postal code (or a large proportion, say 90% of the people living in that postal code) have a sexually transmitted disease, then it is not necessary to know which record in the data set pertains to a particular individual because any individual living in that postal code will very likely have the disease.

Therefore, to summarize, disclosing a data set with only the postal code and some sensitive information can have a high re-identification risk if:

  • The postal code has very few people living in it. Few here would typically be defined as five or less. There are postal codes with only a handful of people. In some provinces and territories the percentage of such small postal codes is quite high.
  • If the disclosed data set pertains to a postal code with many people living in it and the disclosed data set indicates that the majority of these people have the same condition or disease. In that case we can still draw sensitive conclusions about the individuals in that postal code without actually re-identifying their record in the disclosed data set.
How can an individual re-identify their own data?

One question that sometimes comes up is whether a data set can be considered identifiable if a person can find their own record(s) in there.  This definition can be analyzed from a number of different perspectives.

A person may not know if they are in a data set if the data set is a sample. One example is if a data set is based on chart reviews from a random subset of patients at a clinic, then any randomly selected patient in that clinic will not necessarily know that they are in the data set created from the chart review. This uncertainty means that the above definition of identifiability may not be appropriate. One primary reason is that if a patient finds a record that matches their own characteristics they will not know if that record really belongs to them or to someone else.

As a caveat, if the clinic is small then the chances of another patient having exactly the same characteristics would also be small. Also, if the number of variables extracted from the charts is large, then it is less likely that there would be another patient at the clinic who is similar on all of the variables.

Another scenario is if a patient does know that s/he is in a data set. There are a couple of ways that a patient can know that their record is in that data set:

If the patient knows that they are unique in the population, and they find a match in the chart review sample, then they would be confident that their own record has been discovered. But this assumes that the patient knows that they are unique in the population. There are some circumstances where that knowledge is reasonable. For example, Canadians living in urban areas are unique on their date of birth and residence postal code. Therefore, a patient can be confident that if they use these two variables and s/he found a match in the chart review sample, then it is almost certain that it is him/her.

If the data set is not a sample but a whole population, say as in a population registry, then the patient would know for sure that they are in the data set, for example, if the data set is a provincial cancer registry. If the patient finds a single record that matches then s/he will know for sure that it was his/her record. If the patient is unique, then the patient will know for sure that s/he has discovered his/her own record.

Let us assume that one of the above two conditions is true. In that case, is the fact that the patient can find their own record a workable definition of an identifiable data set?

If we accept the above definition then we are setting a high standard. A common way we model an intruder is to consider the kind of background knowledge the intruder would have about the target person (or persons) being re-identified. The more background information the intruder has the greater the re-identification risk. A person will have the maximum possible background information about themselves (i.e., if the intruder is also the target person being re-identified); much more than any other intruder would know. It is true that many people tell their friends and family many things, but they do not tell them absolutely everything. Therefore, the background knowledge of a person about themselves represents the maximum possible background information and therefore the maximum possible risk. If one wants to be conservative, then this is a good approach. But in many cases assuming that an intruder will know absolutely everything does not seem very plausible and sets quite a high standard. In fact, the standard would be so high that we would not be able to share any information at all unless:

  • the data set disclosed is a random sample so that an individual would not know if their record is within the data set (i.e., no population registry could be considered de-identified almost by definition),
  • the sample data set does not include many variables so that there would be other individuals with the same characteristics in the population (e.g., the clinic example mentioned above), and
  • The underlying population is large enough that the chances of an individual being unique are quite small.

A counterargument that can be made is that people are now voluntarily (and involuntarily through their friends and colleagues) revealing more and more about themselves on their blogs, Facebook pages, and Tweets. This is certainly the case and more and more is being revealed every day. Whether this type of self-exposure of personal information amounts to individuals revealing everything about themselves such that an intruder has the same background knowledge as the person themselves remain an empirical question. Although it is easy to argue that we have not quite reached that point yet.

Another scenario to consider is when the following two conditions are met:

  • the data set has some quasi-identifiers and some sensitive information (an example of the quasi-identifiers would be the demographics),
  • there are only two individuals in the data set that have exactly the same values on the quasi-identifiers,
  • one of those individuals, say Bob, gets the data set, and
  • Bob knows the second person who has the same characteristics, Joe.

Under these conditions, Bob would discover the sensitive information about Joe with certainty. Therefore, re-identifying one’s own record resulted in the disclosure of sensitive information about another individual.

The approach we have taken is to define plausible intruders (or archetypes of intruders) and assess what type of background knowledge they would have. The three we consider are: a neighbor, an ex-spouse, and a reporter. The reason we selected these three intruders is because all of the re-identifications that have actually happened and that have been publicly acknowledged have been done by researchers, reporters, or in court cases. All three acknowledged types of intruders can use publicly available information (e.g., in public registries). All three acknowledged types of intruders can talk to neighbors or ex-spouses to get additional information. Therefore, by focusing on these three types of intruders we are addressing plausible risks that we know have happened.

Furthermore, we always ensure that there are always more than two records with the same values on the quasi-identifiers. That way the re-identification of one’s own record does not facilitate the discovery of new information about someone else.

How easily can someone be identified from health data?

The concept of identifiability is critical to managing the privacy risks when collecting, using, and disclosing personal information. In this article we therefore present a framework to reason about this concept, and then at the end some important implications are discussed.

First, it is important to distinguish between directly identifying variables, quasi-identifiers, and sensitive variables.

If the disclosed database contains directly identifying variables then it is clearly personal information. However, if a database contains quasi-identifiers, it can still be personal information. This is a very important point when managing risks from holding sensitive health information. The real examples provided in the table below show how individuals’ identities could be determined by using only the quasi-identifiers. In none of these examples was any directly identifying information included in the database, but it was still possible to determine the identity of at least one individual (and in some cases most of the individuals in the database).

Example Details
AOL search data AOL put anonymized Internet search data (including health-related searches) on its web site. New York Times reporters were able to re-identify an individual from her search records within a few days.
Chicago homicide database Students were able to re-identify a significant percentage of individuals in the Chicago homicide database by linking with the social security death index.
Netflix movie recommendations Individuals in an anonymized publicly available database of customer movie recommendations from Netflix are re-identified by linking their ratings with ratings in a publicly available Internet movie rating web site.
Re-identification of the medical record of the governor of Massachusetts Data from the Group Insurance Commission, which purchases health insurance for state employees, was matched against the voter list for Cambridge, re-identifying the governor’s health insurance records.
Southern Illinoisan vs. The Department of Public Health An expert witness was able to re-identify with certainty 18 out of 20 individuals in a neuroblastoma data set from the Illinois cancer registry, and was able to suggest one of two alternative names for the remaining two individuals.
Canadian Adverse Drug Event Database A national broadcaster aired a report on the death of a 26 year-old student taking a particular drug who was re-identified from the adverse drug reaction database released by Health Canada. The national broadcaster matched the information from the report to the publicly available obituaries for that area of Ontario.
Prescription and diagnosis records of a patient re-identified A neighbour re-identifies the records of a hospital patient by knowing her approximate age, gender, approximate admission date, and postal code.

The implication from this observation is that data may still be personal information even after the removal or obfuscation of the directly identifying variables.

Five Level Model

In a recent article, we elaborated on a five-level model of data identifiability. Here we provide a brief summary of that model – please see the full paper for the details.

We can use this model to understand different types of data and the risks they imply.

Level 1 pertains to information that is clearly identifiable. For example, a database containing names, SSNs and financial transaction information about individuals would be clearly Level 1 on our identifiability scale. At this level no real effort is needed to re-identify an individual. If we have someone’s name and address, we know who they are.

Level 5 pertains to information that is clearly not identifiable. Aggregate information consists of counts. For example, a table showing that 25 people have died from H1N1 pandemic influenza in Canada in November would be considered aggregate data.

Masked data (Level 2) has had some manipulations done to the identifying variables. However, with masked data nothing has been done to obfuscate the quasi-identifiers. Here is a more detailed discussion of masking methods. Because the quasi-identifiers are not touched in Level 2 data, this is effectively still personal information.

The difference between Level 2 and Level 3 is that in the latter the organization attempts to obfuscate the quasi-identifiers as well as the identifying variables. Level 3 data is very common. This level exists because many organizations do not use sound means to de-identify the quasi-identifiers, and therefore they do a poor job at it. For example, in a Canadian context, the date of birth and full postal code uniquely identify many Canadians living in urban areas, making that combination of quasi-identifiers very high risk for re-identifying individuals. Reducing the precision of the postal code to a five character postal code does not actually reduce the risk of re-identification, but is quite a common practice. We termed this level “Exposed” because the organization may believe that they have de-identified the data, but in fact the risk of re-identification is still high. Therefore, the data custodian has a high risk exposure and may not know it (i.e., the data custodian will not have put in place any controls to mitigate those risks).

It should also be noted that re-identifying “Exposed” data does not indicate anything about methods for de-identification. In fact, one would expect that “Exposed” data would be quite easy to re-identify.

With level 4 data an objective assessment of the re-identification risk is performed and a data custodian can substantiate claims that the data is properly de-identified. Level 4 data can be microdata or in tabular form; the same point about objective risk assessment applies for tabular data. It is only at level 4 that data moves from being personal information to not being personal information. Level 4 data is called “Managed” because the risk of re-identification is managed by the data custodian. The risk of re-identification may vary across organizations that disclose Level 4 data, but in all cases it is managed. This means that the custodian knows what the risk of re-identification is objectively (i.e., the custodian has measured it), and has taken reasonable actions to manage that risk.

Data at level 4 and level 5 present the least risk to the organization because a strong case can be made that this is not personal information any more.

Re-identification, Effort, and Skill

As data moves up the scale more effort is needed to re-identify them. Therefore, even though level 2 and level 3 data are still considered personal information and the risk of re-identification would be considered quite high, the amount of effort to re-identify an individual is also higher as data moves up.


For example, consider a data set with full name and address, and the date of birth. This would be a level 1 data set. The level 2 version of this data set that has had its identifying variables masked, where all we have left is a postal code and date of birth. With some effort those two demographics variables can be linked to a person (e.g., a full name and address). This re-identification is more effort than the level 1 version of that data where we already had the name and address. Similarly, level 3 data requires more effort to re-identify than level 2 data.

As a corollary, higher level data also takes more expertise and skills in re-identification to re-identify than lower level data. Therefore, a lay person can re-identify a level 1 data, but it is quite likely that an intruder skilled at re-identification would be needed to re-identify data at levels 2 and 3. However, with effort and skill the probability of successful re-identification is high.

For data at level 4 and level 5, even an intruder with significant effort and skill will have a lower probability of re-identifying individuals in the disclosed data set. This is represented in the graph above as a disproportionate increase in resources and skills needed by an intruder to re-identify that kind of data. It is not impossible, but the pre-requisites have increased dramatically.

Furthermore, level 4 and 5 data have a built-in disincentive for an intruder in that it is not worth it for them to invest their skills and time into re-identification if the probability of a successful match is low. The value of the re-identified data, whether economic, political, notoriety, or based on some other criterion, would have to be quite high for an intruder to invest the necessary resources to attack a level 4 or level 5 data set. One consequence of this is that few people would be able and willing to re-identify a level 4 or 5 data set. If a level 4 or level 5 data set is lost or stolen, it is not an automatic consequence that an attempt will be made to re-identify the data.

Another important consideration when we talk about re-identification effort is whether we are concerned about the re-identification of a single individual or many individuals in the disclosed data. Of course, the re-identification of many individuals will require more effort than a single individual. Therefore, when we talk about re-identification effort, we mean a normalized effort (i.e., effort to re-identify a single individual) so that the five level model applies irrespective of the data set size.


As a concrete example, let us consider a hypothetical clinical data set with the following variables: first name, last name, health insurance number, street address, six character postal code, date of birth, date of doctor’s visit, and whether the individual has a sexually transmitted disease. In the five level framework, the data sets would be:

Level 1. The full data set as is.

Level 2. The names are replaced by fake names, the health insurance number is replaced with a fake number, and the street address field is removed altogether.

Level 3. The data set at Level 2 also has the postal code generalized from six characters to five characters. The risk at Level 3 is the same as Level 2, but the organization believes it has de-identified the data and discloses it. Therefore, the organization is exposed.

Level 4. The data set at Level 3 is further modified by replacing the 5 digit postal code with a single character postal code, the date of birth is replaced by age, and the date of visit is replaced by the month of the visit. A re-identification risk assessment is then performed on this data set and the risk was found to be below a pre-specified threshold.

Level 5. The number of individuals with a sexually transmitted disease.

In this example, the data in the last two levels would be considered de-identified data, but the first three would still be personal information.

Data Protection

Arguably, it would require less investment and resources to protect levels 4-5 data compared to levels 1-3 data. This is one of the key advantages, from an organizational perspective, of de-identification.

Acknowledgements: the original idea for this model came out of discussions with Craig Earle of the Ontario Institute for Cancer Research and this is an adaptation & extension of a model he presented at the PHIPA conference in Toronto in the fall of 2009.

What is the re-identification risk from small simple counts of disease cases?

A custodian has been asked to release counts of people with a particular disease. For example, in the year 2008 4 people had that particular disease in Ontario. Since the count is less than five, is there a re-identification risk in disclosing this information? To make the example concrete, let’s say it is a rare disease, like botulism. Note that the custodian has not broken down the numbers by age group, gender, or other demographic – these are simple counts.

When analyzing a situation like that we have to: (a) determine the plausible intruder scenarios, (b) decide if the intruder will identify an individual, and (c) whether such identification would allow the intruder to discover something new about the patient. Based on such an analysis one can determine whether a plausible re-identification risk exists.

To make the intruder scenario concrete, let’s assume that the intruder is Bob and he is trying to re-identify Alice. The only way that Bob would know that Alice is in the database is if he already knew that she had botulism. Therefore, Bob would not gain any additional information by knowing that she is one of these four cases.

In such a situation, disclosing counts, even if they are for rare diseases and if the counts are below 5, does not necessarily represent a plausible disclosure risk.

If we extend the example and say that the geography is not the whole of Ontario but a small town, and that everyone in town knew that a certain four people fell ill after eating at a restaurant because their names were posted in the local newspaper. But no one knew what was wrong with them. Then if, say, a public health agency reports that four people had botulism in that town, then that would be problematic. The reason is that an intruder only needs to know that Alice was one of the people affected by a food borne illness at that restaurant, and the disclosed number lets the intruder find out something new about Alice – that she has botulism.

Therefore, the disclosure of small counts (i.e., less than 5) does not necessarily indicate a disclosure risk. But, as the above example illustrates, the specifics of the case may change that general conclusion.


Are there any de-identification standards?

One question that often comes up is whether there are already de-identification guidelines available today. This is important because existing statutes and regulations do not provide very precise descriptions of what needs to be done to de-identify data, and especially health data. We therefore compiled the following list of standards that one can use. Here they are:

How can I de-identify longitudinal records?

At the outset, it is important to make a distinction between three types of longitudinal records that occur often in practice.

The first type consists of specific variables that are collected from all patients at specific points in time. For example, if function and quality of life data is collected every year as part of a cancer survivor study, then the same variables are collected from patients every year. For this type of data existing de-identification algorithms will work well, such as k-anonymity algorithms which use generalization and suppression.

We have two kinds of quasi-identifiers: the basic ones and the yearly ones. The basic quasi-identifiers will likely consist of demographics that do not change, such as date of birth and gender. The yearly ones would include things like: where the patient lives at that point in time and maybe some socio-economic variables. When using one of the existing de-identification algorithms, the yearly variables should be linked or correlated so that they are de-identified the same way. This linking will make it easier to analyze the data since having the same variable at different levels of generalization, for example, is not very useful from an analysis perspective.

The second type of longitudinal data consists of visits for each patient but there is an anchor visit. The difference between the above (first type) data and visit or encounter data is that each patient can have a different number of visits. This makes it difficult to analyze this kind of data with current de-identification algorithms. Having an anchor visit, however, can solve that problem. For instance, for cancer patients an anchor visit would be when diagnosis occurred. With an anchor we can then compute all other visits as relative dates. For example, visit 1 after diagnosis would be +15 days. The actual date of diagnosis is not disclosed and only intervals are disclosed.

If the only demographic information that needs to be analyzed for each visit is its date, and there is an anchor visit, then converting all dates to relative dates makes it unnecessary to de-identify the visit data. The reasoning is that an intruder would be very unlikely to know something as specific as intervals between visits to launch a re-identification attack, and intervals are less likely to be unique than actual dates.

The third type of longitudinal data is the same as the one above except that there is no anchor visit and/or it is necessary to disclose additional demographics for each visit. A good example of this is EMR data. Here there is no real anchor event to use across all patients, and often other demographics are useful to disclose about each visit, for example, where the patient lives, how many children they have, whether they are married – all of which can change from visit to visit.

For the third type of data there are a number of new algorithms that are being developed that will specifically deal with this kind of data. But this is still at the research stage.

How can I safely release data to multiple researchers?

Let’s break this into scenarios.

First Scenario:

We have a data custodian who wants to disclose data to researcher A and researcher B. Each researcher will get a different set of variables. But the two data sets pertain to the same individuals/patients. This is a rather simple scenario because there are no overlapping variables among the two researchers.

We then assume that there will be collusion between A and B in that they will bring their two data sets together and try to match the records. If they are successful then the variables in the two data sets will be known for all of the individuals in the data. This may not have been the intention of the data custodian (i.e., researcher A may not have had permission to see the variables given to researcher B).

Of course, in the above scenario you can replace researcher with any other type of data recipient, such as government department, journalist, or a combination of data recipient types. The same principles apply.

The way to handle this particular scenario is to shuffle the records that are disclosed to each researcher. This way if the researchers try to match the two records they cannot use the positional information for that purpose.

If the data sets given to both researchers pertain to exactly the same individuals and they are both of size N, then if the two researchers try to match the records randomly the proportion of records that would be correctly matched is 1/N, on average.

If researcher A had N records on N people and researcher B had n records on n people where n<N then the proportion of B’s records that would be correctly matched is still 1/N, on average, and the proportion of A’s records that would be matched correctly is n/(N^2), on average.

Therefore, as the size of the disclosed data sets increase, the proportion of correct matches decrease. Furthermore, it should be noted that under this scenario the intruder would not know which records were correctly matched, only that a certain proportion of them were correctly matched.

If the proportion of records that will be matched successfully after shuffling is small and if the researchers will not know which records were matched successfully, then this acts as a strong dis-incentive to match the two data sets. Therefore, always shuffle your data before disclosure.

Scenario Two:

This time a data custodian is providing data to two researchers, A and B. The two data sets have some overlapping variables, for example, they may both have the patients’ date of birth and postal codes. Also, the two data sets have no directly identifying variables in them. We will call these variables the quasi-identifiers. The custodian is concerned about the two researchers colluding and trying to link the two data sets together. The custodian does not want them to link the two data sets together because of privacy or other legal concerns. How can the custodian assess that risk?

In a recent paper (to appear in PAIS 2010 and also attached to this article) we have developed some metrics that can be used to evaluate the proportion of records that can be re-identified if the two researchers try to link the data sets on the quasi-identifiers.

If the two data sets have N records and they have the same patients in them, then the proportion of records that can be correctly matched if the two researchers try to link their data sets is J/N. Here, J is the number of different values (called equivalence classes) on the quasi-identifiers. For example, {1/1/1980, K1H 8L1} would be one of the equivalence classes.

If one of the data sets is a sample of the other, with the smaller data set having n records, with n<N, then the equation is a bit more complicated and described more fully in the paper (equation 1). This will give you the proportion of the records in the smaller data set that will be correctly matched if the two researchers try to link their data sets.

Note that the two researchers will not know which records were successfully matched. For example, if the proportion produced by either of these two metrics is say 0.4, then it is not known which 40% of the records are correctly matched. Therefore, for the matching exercise to be worthwhile for the researchers, they need to have confidence that the match rate is high. It can be argued that any match rate below 80% will be sufficiently high risk for a researcher that it would not be worth it (i.e., less than 80% of the records were correctly matched).

A data custodian can then run cross-checks on their data releases to see if the risk arising from such potential collusion is acceptable.

The above analysis also applies if two different data sets have been disclosed to the same researcher at two different instances of time.

Read More: 2010 – A Method for Evaluating Marketer Re-identification Risk

How should de-identification be incorporated into a research ethics review process?

In this article we will present a recommended way to integrate de-identification into an REB review process. This is based on our experiences after we have tried many alternative processes, and found that this approach works well in practice. The context for this article is a research protocol that involves the secondary use of some existing data. This process can be adapted for protocols with interventional or prospective research, but the example used here is secondary use.

A basic work flow diagram illustrating this process is shown below.


The key players are the researcher herself, a scientific review committee, a data access committee (DAC), the REB, and a database administrator. The scientific review committee may be a committee formed by a funding agency, or peers at a particular institution. We do not really care how the scientific review is done, but it is included here to illustrate a few key points. The DAC consists of de-identification and privacy experts who would perform a re-identification risk assessment on the protocol.

The DAC (or rather, members of the DAC) needs to have access to tools that can perform re-identification risk assessment. These tools would also need to be able to analyze the original data being requested in order to perform that risk assessment. The database administrator is responsible for the data with PHI and has an appropriate de-identification tool in place to de-identify the data.

The researcher submits the protocol to the scientific review committee and the DAC at the same time. The reason is that in theory there may be some iteration between the scientific review process and the DAC process.

The DAC would perform a re-identification risk assessment and decide how to adequately de-identify the data set requested by the researcher. This activity will require some access to the data in order to perform the risk assessment. The re-identification risk assessment process may result in changing the precision of the data that is being requested. For example, the original protocol may request admission and discharge dates, but the risk assessment may recommend replacing that with length of stay in days. Such changes in data may require changes in the protocol as well. If the protocol changes then the scientific review may have to be revisited. Also, during the scientific review methodological or theoretical issues may be raised, which may affect the requested data elements. If the requested data elements change, then the re-identification risk assessment may have to be revisited. Therefore, at least conceptually, there is potentially some interaction and possibly iteration between the scientific review process and the re-identification risk assessment process performed by the DAC.

In practice the interaction between scientific review and DAC review is often not possible because of the way peer review is often structured (e.g., with the research funding agencies). Therefore, since there is likely not to be any interaction or iteration between scientific review and data access review, we can save time by doing these activities in parallel, or sequence them and hope for the best!

If either the scientific review committee or the DAC do not approve the protocol, then it would go back to the researcher for a revision. If the scientific review committee approves the protocol, then they provide some kind of approval documentation, such as a letter.

The researcher provides the DAC with the protocol as well as a variable checklist. This checklist is quite important in that it clarifies the exact fields that are requested. It also highlights to the researcher which fields in the requested database are quasi-identifiers and may therefore undergo some kind of generalization and suppression. The checklist allows the researcher to indicate the level of data granularity that they will accept. For example, a researcher may be willing to get the year of birth instead of the full date of birth. If this is explicitly specified up-front in the checklist then it would potentially reduce significantly the number of iterations between the researcher and the data access committee. The checklist would also contain information about the weight of the quasi-identifiers to indicate their importance. For example, a weight of 1 would mean that a particular variable is particularly important and it should be minimally impacted by the de-identification. Alternatively, a low weight indicates that a quasi-identifier, relatively speaking, is less important to the eventual analysis. The more trade-offs that the researcher performs up-front, the quicker the re-identification risk analysis.

An example of such a checklist (or part of it) is attached to this article. This is an example from the Ontario birth registry.

The DAC determines how to appropriately de-identify the data given the risks, and negotiates that with the researcher. This negotiation may take up a number of iterations. These iterations would be rapid because a single individual from the DAC is assigned to negotiate with the researcher. The objective is not to create another layer of bureaucracy, but to have a negotiation and provide data to facilitate making trade-offs. The output from this process would consist of two things:

Risk Assessment Results. These would consist of a report indicating the de-identification that will be applied as well as the residual risk in the data set that will be disclosed.

Data Sharing Agreement. Because the amount of de-identification would normally be contingent on the security and privacy practices that the researcher has in place, the researcher must commit to implementing these practices in a data sharing agreement. Such an agreement would not always be needed. For example, if a researcher is an employee of a hospital and the data comes from the hospital, then the researcher would be bound by her employment contract which should cover the handling of sensitive patient data. However, if the researcher is external to the hospital or at a university, then a data sharing agreement would most certainly be recommended. Note that a different data sharing agreement would be needed for every project because the specific terms may vary depending on the data (sub-) set required.

Once the REB receives these two items, it will have sufficient evidence that the residual risk of re-identification is acceptably low and will have the terms of the data sharing agreement that the researcher will be signing for this particular data release. Many Canadian REBs will waive the requirement to obtain patient consent if they are convinced that the requested data set is de-identified. And now the REB can perform a regular ethics review knowing that the privacy issues have been addressed.

If the REB approves the protocol, then this information is conveyed to the database administrator who would then create a data set according to the risk assessment report. The database administrator would then provide the data to the researcher in some secure format.

If the REB does not approve the protocol for reasons not related to re-identification risk, then the researcher would have to resubmit the protocol at some later point. If the protocol is not approved because of an issue related to re-identification risk, then the researcher would have to go through the process again with the DAC to perform another risk assessment.

Is sampling sufficient to de-identify a data set?

Sampling means drawing a subset of the rows from the data set and disclosing those instead of the complete data set. The reason why sampling is sometimes used is because it thwarts a prosecutor type attack by making it difficult for an intruder to know if a specific target individual is in the disclosed data set. For example, if we want to disclose a breast cancer data set and the intruder knows that Alice has breast cancer, then the intruder will know that there is a record belonging to Alice in a population registry of breast cancer patients. However, if only a sample is disclosed then the intruder would not know if Alice’s record is in the disclosed sample. Therefore, it will not be known if there is a record in the sample matches Alice’s particulars (say postal code and date of birth) that it is truly Alice’s record or not. Such uncertainty would in principle deter an intruder from even attempting a re-identification of Alice’s records.

Sampling is only effective if the sampling fraction is relatively small. For example, if the sampling fraction is 99% then the disclosed data set is almost as good as the population registry. However, say, a 25% sample would create considerable uncertainty as to whether Alice is in the sample or not.

There are three problems with sampling, however. The first is that many data users, for example, researchers, would be very upset if a large data set existed and they were only given a small subset of it for their analysis. The smaller sample means a reduction in statistical power, and hence, the ability to detect statistically significant relationships or effects is reduced. If the power is too low then it may not be worth doing an analysis at all.

The second issue is that if Canadians are unique on say their postal code and date of birth, and these two variables are included in the sampled data set, then the intruder may still look for Alice’s record in the disclosed sample. If the intruder finds a match then the intruder will know for certainty that this record belongs to Alice. If the intruder does not find a match then he will know that Alice was not included in the sample. Therefore, sampling does not provide protection when a sample unique is known to be a population unique.

The third issue is that sampling does not protect against journalist risk. An intruder may attempt to match the disclosed sample with another public database. Depending on the variables that are included in the disclosed sample, the intruder may find a correct match and re-identify individuals. In fact, probabilistically, the sampling fraction in that case has little impact on the probability of a correct match.

Therefore, sampling should be considered as only one of a number of strategies that can be used to manage re-identification risk. It is not sufficient to rely on sampling only as a means to protect disclosed data from re-identification.


Should de-identified data go through a research ethics review?

This question pertains to research protocols that require the analysis of existing data for secondary purposes or protocols where new data is being collected in de-identified form. An example of the former is when a protocol requires the analysis of data from a disease registry. An example of the latter is when the study involves the observation of an event or procedure and all data collected during that observation is considered to be de-identified.

In the current version of the Canadian TCPS (Tri-Council Policy Statement), the answer to the above question is no. The TCPS is the document used by Research Ethics Board (REBs) to guide their conduct. It is only a guidance document as REBs in Canada are currently not regulated. However, in practice, REBs do try their best to follow the TCPS.

There are at least two ways that REBs implement the guidance of not having to review de-identified data. Both of these are in use today by Canadian REBs.

In the first approach the REB form has a checkbox question asking the investigator if the data is de-identified. If the investigator checks that box then the REB does not review the protocol and it is automatically approved. The reasoning is that it is de-identified data and therefore there is no requirement to review the protocol. However, the problem with this approach is that the decision as to whether the requested data is de-identified or not is made by the investigator. The investigator has a vested interest in the claim that the data is de-identified. Therefore, the investigator is not the best person to make a subjective self-declaration about identifiability. Also, the decision on whether a particular data set is de-identified or not is not that trivial and most investigators do not have expertise in the topic of identifiability. Therefore, even if their motives are absolutely pure and there is not a conflicted bone in their being, an investigator would still be challenged to make a self-declaration without a careful and perhaps sophisticated analysis of the data set. In such a case, such self-declaration would be considered suspect.

This first approach is often taken because of resourcing. It provides a way for the REB to reduce its workload by not having to address protocols which are deemed to be lower risk compared to other types of protocols (e.g., those that involve some medical intervention or treatment). However, this can have considerable risks.

In the second approach that is sometimes used, if the investigator makes the self-declaration that the data set is de-identified then the protocol goes through an expedited review by the REB  (or sometimes referred to as a delegated review because a sub-committee reviews it on behalf of the full board). If during this expedited review there are questions about whether the data collected/disclosed is truly de-identified, then the protocol may go through a full board review. This second approach is much more sensible and addresses the concerns that are raised with the first approach described above because there remains some oversight.

A more general issue that is raised, however, by not requiring REB review of protocols dealing with de-identified data is that it provides a mechanism for protocols not to receive proper oversight. Even if a protocol will only require truly de-identified data, there are group harms that are not addressed by the identifiability question. For example, if a protocol deals with a sensitive health issue for an aboriginal group, and there is a real risk that the results may stigmatize that group. Or consider a study of pollution in a particular neighborhood which may have a negative impact on insurance and housing prices for people living in that neighborhood. In both of these cases the protocol may collect or analyze only truly de-identified data. But the protocol should still go through an REB review nonetheless because of the group harm concerns. An institution which adopts the first approach above would essentially provide a loophole for group harms in protocols with de-identified data.

Therefore, the general recommendation for REBs to allow them to manage risks appropriately is to use expedited reviews for protocols that are self-declared to be with de-identified data. The purpose of the expedited review is to provide oversight on these self-declarations and to check for other ethical issues, such as group harms, which may be relevant for the particular protocol even if the data are de-identified.

Should we de-identify if technology is moving so fast?

It is sometimes stated that re-identification technology is moving forward all the time, and that new databases useful for linking are being made available all of the time, and therefore that it is futile to de-identify any data sets. There are two counterarguments to this view.

First, if we adopted the “advances in technology will happen” argument then there is no point in using encryption technology either. New ways are being devised to break existing encryption algorithms, either through faster computers or clever algorithms. We know this is likely to happen. When this happens then material that was encrypted with the old technology may be compromised. We hope that this will happen far enough in the future that the compromised information has little value.

In the case of de-identification, we can do something else. With the exception of data sets that are disclosed in the public domain (e.g., on web sites), we can impose additional restrictions such as data sharing agreements and audits on the data recipients as a way to ensure good behaviour. That way the custodian can still maintain some control even if technology does advance in the future to make it easier to re-identify individuals. Such agreements can have stipulations for data destruction as a way to mitigate the risk of re-identification becoming easier in the future, and have provisions prohibiting re-identification of the data.

What are some real world de-identification examples?

These articles describe in detail how technologies incorporated in PARAT have been used to de-identify actual data sets. They demonstrate the risk analysis that is performed and show how de-identification methods can be used in practice.

Publication Type of Data Download Link
 Canadian Journal of Hospital Pharmacy  Prescription and diagnosis data [download here]
 BMC Medical Informatics and Decision Making  Hospital discharge abstract data [download here]

What de-identification software tools are there?

There are five de-identification tools that are generally available. These tools work on structured data. There are other tools that focus specifically on free-form text, but these are not covered here.

Also, it is important to make a distinction between de-identification tools and masking tools. The latter do not really provide adequate protection for personal information. There are many masking tools on the market (about two dozen vendors with tools with a wide variability in functionality). A more detailed description of the difference between de-identification and masking is described in this article.

Beyond the five de-identification tools described below, the tools that exist are internal to organizations and therefore are not generally available, or have been developed for personal use (by researchers) and therefore have not been applied broadly.

The five generally available de-identification tools are:

  • The PARAT tool from Privacy Analytics Inc. implements comprehensive risk management for three types of identity disclosure risk. More information about this product is available from here.
  • mu-Argus, developed by the Netherlands national statistical agency. More information about mu-Argus can be found here and the tool itself can be downloaded from here.
  • The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. It is an open source tool available here, with documentation available here.
  • The University of Texas at Dallas Anonymization Toolbox, which contains open source Java implementations of some k-anonymity and attribute disclosure control algorithms, with documentation.
  • The sdMicro package in R provides some basic de-identification functions. You can download it from here.
Tools Assessment

The only tool that is commercially available and actively supported is PARAT from Privacy Analytics. Another useful point of comparison is that the algorithm implemented in PARAT has been shown in a recent article to perform better than the algorithm implemented in CAT. Furthermore, the risk estimator used in PARAT has been shown to produce more accurate de-identification results than the one incorporated in mu-Argus.

The UTD toolbox includes some of the same algorithms as CAT. This toolbox contains a set of capabilities rather than a tool that is ready to use by an end-user (e.g., an analyst), and therefore is targeted more at developers. It is also not actively supported as a product.

We spent some time evaluating the CAT tool. There are a significant number of usability issues with the tool. For example, we were unable to find the location where the value of k for the k-anonymity algorithm was defined, it was not possible to view data by equivalence class, and the data views gave the same record id every 60 records. There is an inability to import standard data files. The lack of documentation and support made using the tool difficult. We also found it quite buggy. While this may have been good to complete a Master’s thesis project, it clearly lacked important functionality and robustness for broader use.

The sdMicro package cannot handle large data sets and will crash often. We’ve had a lot of problems working with it on our data sets. It is a decent tool for experimenting with de-identification techniques but is not suitable if you want to de-identify real data sets.

Note that de-identification tools are different from masking tools. The first attached document provides an overview of de-identification techniques and explains at some length the differences between these two approaches and when each is more suitable.

The second attached document is a report produced by Canada Health Infoway that contains an overview of de-identification techniques as well as a summary of the tools that are available on the market today.

Further Reading: 2009 – Tools for De-Identification of Personal Health Information

What methodology should we use to approach de-identification?

Our approach to re-identification risk assessment and de-identification is risk-based. The following documents describe this general approach in more detail.

Title Place Published Download Link
Dispelling the myths surrounding de-identification Office of the Information and Privacy Commissioner of Ontario [download here]
A positive-sum paradigm in action in the health sector Office of the Information and Privacy Commissioner of Ontario [download here]
Risk-based de-identification of health data IEEE Security and Privacy [download here]
Methods for the de-identification of electronic health records for genomic research Genome Medicine [download here]
De-identification: Reduce privacy risks when sharing personally identifiable information Privacy Analytics Whitepaper [download here]
Which type of threshold should we use for de-identification?

Many types of thresholds have been suggested and used for deciding when a data set is de-identified. Some common ones are:

  • Cell size of 5, 3, or 10
  • Uniqueness
  • Rareness

A question that comes up in practice is “which threshold should we use?”

In fact, all three of these are related. The general rule is:

  X% of the records are in cell sizes >= k (or equivalence classes of size k)

A common instantiation, called 5-anonymity is:

  100% of the records are in cell sizes >= 5

This means that every possible value on the quasi-identifiers occurs at least five times.

The uniqueness criterion can be stated as 2-anonymity:

   100% of the records are in cell sizes >= 2

Although, there are cases where 95% and 80% are acceptable values for X.

For example, some cancer registries release their data to researchers if less than 20% of their records are unique, and to the public if less than 5% of their records are unique.

The third criterion, rareness, means one has to ensure that there are no rare records. The general rule here is:

  all equivalence classes have >X% of the records in the population

This rule ensures that there are no equivalence classes that are relatively rare. Rareness is often defined in terms of the population not in terms of the records in the data set.

For example, some national statistical agencies will not disclose census information if any equivalence classes cover less than or equal to 0.5% of the population. This is the rule used to justify not releasing individual ages above 89 years because very few people live beyond that age (i.e., fewer than 0.5% of the population are in each of the 90+ age range).

The question is which one of the above rules should be used and what values should be relied on? There are no hard rules on this, but a reasonable approach is to use precedent.

The argument for using precedent is that it signifies acceptability. If a particular rule has a lot of precedent then it suggests that society has accepted the level of risk implied by the rule. For example, there is a lot of precedent spanning multiple decades for the cell size of five rule, so it is safe to assume that this is a generally accepted level of risk.

Precedent may be specific to a certain type of data or registry. For example, some precedents may be more acceptable for the disclosure of cancer registry data, but may not be acceptable for sexually transmitted disease or mental health data. Also, of course, it will depend on who the data is being disclosed to.


Why can’t we just add noise to the data to de-identify it?

A method that is sometimes used to de-identify data sets is to add noise to the values of the variables. For example, a random number of days are added to a date of birth to create a perturbed date of birth.  You can also add noise to location data by moving a postal code to a randomly selected adjacent postal code.

In practice we have found that the data recipients do not like this approach to de-identification because they cannot trust the data anymore. For example, if we have a 50 year old male with cancer, it would not be known whether he was really fifty years old or 55 years old or 45 years old. The shift in age may make a difference in the analysis and in the conclusions drawn. Data recipients are concerned about drawing incorrect conclusions from the data because of perturbation.

The same mistrust issues come up with another technique called “microaggregation”. The basic idea here is to identify a cluster of similar records in the data set and then replace the actual values with the average (or median) of that cluster. For example, the age would be replaced with the average age of the cluster. This is similar to the approach called “hot-deck imputation” that is used to deal with missing data. Again, the data recipients’ reaction has been that they cannot trust the values. If a cluster inadvertently contains an outlier or an influential observation then the average may be distorted excessively, and potentially incorrect or inaccurate conclusions drawn.

Along the same lines as above, this is the reason why we have found data recipients and analysts reluctant to use synthetic data. In principle, synthetic data is not real data and therefore there are no identity disclosure risks with releasing it. Also, in principle the basic (bivariate) correlational structure of the data is maintained in the synthetic data. But if an analysis is complex, the distributions are non-standard, and the multivariate correlations structure is not captured in the synthetic version, then some relationships may not be detected or incorrectly detected in the synthetic version.

The approach that is used more often, at least in the context of health data sets, is to generalize the variables. So say a cancer patient born on 1st January 1959 may be generalized to just January 1959, or even just 1959. However, that number is still true, but has less precision. Therefore, the data can be trusted and the risk of drawing incorrect conclusions is reduced.

Data Synthesis

What is synthetic data?

Synthetic data is generated from real data, but is not real data. It is “fake”, or synthesized data that mimics the statistical properties of real data. For some use cases (e.g., data science and software testing) it can act as a proxy for real data.