The primary type of disclosure risk that needs to be focused on is identity disclosure. An underlying assumption for this type of risk is that there is an intruder who has two pieces of information: (a) the actual data set that has been disclosed and (b) some background information about one or more people in this data set. The background information is described by a set of variables. These variables are the quasi-identifiers.
Examples of common quasi-identifiers in the context of health information are:
- dates (such as, birth, death, admission, discharge, visit, and specimen collection),
- locations (such as, postal codes, hospital names, and regions),
- race and ethnicity,
- languages spoken,
- aboriginal status,
- and gender.
This set of variables may expand if more data are collected about the public and more public registries are made available. For example, in many jurisdictions in the US the voter lists are publicly available. This means that the basic demographics (such as full ZIP code, date of birth, gender, all included in the voter list) are automatically considered quasi-identifiers.
Individuals may also post personal information about themselves on web sites or announce that information to their friends. For example, many new parents announce the exact birth weight of their new child. When we examined birth registries, we found that weight, hospital of birth, and age of mother make most births unique. Therefore, weight is quite a powerful identifier.
It is simply wrong to make a general statement that all variables in a disclosed data set are quasi-identifiers. The reason is that it is often not plausible for an intruder to gain background information about all variables in a data set. For example, using legitimate means in Canada it would be very difficult for an intruder to obtain all of the diagnosis codes for a patient and use these for re-identification. It is very difficult for an intruder to get a complete set of lab test result values and use these for re-identification.
One of the skills required in managing re-identification risk is to decide the plausible quasi-identifiers that need to be considered. This analysis looks at the jurisdiction and the types of information that is publicly available.
Therefore, to summarize, a quasi-identifier is a piece of information that an intruder can get hold of about a specific target individual or about a large number of people through the following means:
- Personal knowledge of the specific target person (e.g., a neighbor, co-worker, ex-spouse).
- The specific target person is famous and there is information publicly available about them.
- Publicly available registries (e.g., voter lists and court records) or the media (e.g., obituaries published in newspapers or on-line).
- Information that individuals post about themselves on the Internet (e.g., information they post on social networking sites).
- Information that individuals often disclose to a large number of people (e.g., their baby’s birth weight or birth date).
It is also important to remember that it is possible to predict a quasi-identifier from another variable. In this case, both of the variables must be considered quasi-identifiers. There is no point protecting against a variable A but not variable B, and the intruder can easily predict A from B. Therefore, it is important to search for correlated variables in a data set. Examples of correlated variables are:
- Date of birth of a baby and date of discharge from a hospital.
- Date of death and date of an autopsy.
- Weight at birth and weight of baby at discharge from a hospital.
- Age and date of graduation.