It is important to be able to categorize variables in a data set according to their role in re-identification because it helps us reason about risk. Below is one categorization that we have found useful.
The
specific scenario we are looking at is that of a data custodian disclosing health
information for secondary purpose. The data custodian needs to be able to assess the identifiability of the data. We assume that the data in question has
variables and that each record pertains to an individual. This
individual can be a customer, an employee, or a patient. The variables
can be divided into four groups:
Directly Identifying Variables. One or more of these
variables can be used to uniquely identify an individual directly either by themselves or in combination with other readily available information.
For example, a person's name is a directly identifying variable. A person's name does not necessarily identify an individual uniquely, however. There are more than 280 men named "John Smith" in Ontario, therefore in this case the full name by itself cannot identify the individual. By combining the name with another directly identifying variable, such as a phone number, then an individual can be identified uniquely. There will be exceptions of course because some names are very unique.
As another example, consider an email address. An email address uniquely identifies a person because it is almost always the case that a person is assigned a unique email address. An email address "john.smith@myco.com" will pertain to a single individual working at MyCo. Again, there will be exceptions where an email is "banana55@hotmail.com" really does not tell you much about the identity of an individual.
Some directly identifying variables can only be used for re-identification in conjunction with other information. For a variable to be directly identifying this other information has to be readily available, for example, the telephone white pages, a Google search, or access to a company database. For example, a SSN is directly identifiable if an intruder has a database containing SSNs. Say, if an organization uses SSNs as unique identifiers for their clients and the intruder has access to that database of SSNs, then in that case an SSN is a directly identifying variable.
Other examples of directly identifying variables include telephone number, health insurance card number, credit card number, and social insurance number.
Therefore, whether a variable is a directly identifying variable will depend on the context. It will depend on what other information an intruder would plausibly have ready access to and it will vary by record. Since we need to make statements about the data, we will say that if for some of the records a variable is a directly identifying variable then we will treat it as a directly identifying variable for all of the records.
Quasi-identifiers. One or more quasi-identifiers can be used to probabilistically identify an individual, either by themselves or in combination with other available information. This definition sounds quite similar to the one above, but there are two differences. Quasi-identifiers do not necessarily make an individual unique and the auxiliary information that is needed for re-identification may not be readily available.
For example, it is more difficult for an intruder to know an individual's exact date of birth or exact date of admission to a hospital compared to knowing an individual's name. Put another way, people are more willing to divulge their name and email address, say, than they are willing to divulge their date of birth and postal code. Therefore, it takes more effort, time, money, and skill to develop the background information to re-identify individuals using quasi-identifiers compared to direct identifiers.
A key condition for a variable to be a quasi-identifier is that it has to be plausible for an intruder to be able to get background information about the individuals in the disclosed data sets using the quasi-identifiers. This is an important condition for a variable to be a quasi-identifier. Additional information about getting background knowledge is provided in this knowledgebase article.
Examples of quasi-identifiers include sex, date of birth or age, geocodes (such as postal codes, census geography, information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.
Sensitive variables.
These may be variables that characterize, say, the financial or health
status of an individual. It is information that is sensitive and that
individuals would normally consider private. If there is no sensitive information in a data set then it is arguable whether there is anything to protect.
It is important to recognize that sensitivity is relative. For example, is "flu" sensitive ? If an intruder re-identified a record in a disclosed database and found out that they have the flu, does that matter ? This is a complicated question that has to do with harm, but a simple consideration is that an individual patient may feel violated by the fact that their health information could be re-identified and this may have an impact on their behavior and level of trust in the custodian, even if the sensitivity of the information that was re-identified is low (also see the discussion in this knowledgebase article).
It should also be noted that the distinctions above will vary depending on the data set and the context. In one instance a variable may be a quasi-identifier and in another case it may be a sensitive variable. For example, the diagnosis code can be a quasi-identifier if there is a plausible way that an intruder can get background information about an individual's diagnosis code and then it can be used for re-identification, otherwise it would be a sensitive variable. Also, sometimes a variable may be a directly identifying variable and in another it may be a quasi-identifier. For example, a health insurance card number is directly identifying if the intruder has access to a database of patients and their health insurance card numbers. If the intruder does not have access to such a database then this would not be a direct identifier. Although, in practice health insurance card numbers are useful for fraudulent purposes, and therefore they should not be disclosed anyway, but the primary driver for not disclosing that information may not necessarily be the risk of re-identification of individuals.
The
author(s) retain all copyright to this knowledgebase article. Please
include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.