Disclosure risk can be characterized as prosecutor risk or journalist risk (see http://www.jamia.org/cgi/content/abstract/15/5/627). These are just colorful names for two common types of risks. They are similar in that they both pertain to the risk of an intruder re-identifying a single individual in the data set that is being disclosed. But they also make different assumptions about the nature of the attack. More definitional information and examples are provided in the attached document.
With prosecutor risk we assume that the intruder is trying to re-identify the record belonging to a specific person. This specific person is known to the intruder. For example, this specific target person may be the intruder's neighbor or a famous person. The intruder has some background information about that target, and then uses this background information to search for a matching record in the disclosed database.
If any of the three following conditions are true, then prosecutor risk is a threat:
- The disclosed dataset represents the whole population (e.g., a population registry) or has a large sampling fraction. If the whole population is being disclosed then the intruder would have certainty that the target is in the disclosed data set. Also, a large sampling fraction means that the target is very likely to be in the disclosed data set.
- The dataset is not a population but is a sample from a population, and if it can be easily determined who is in the disclosed dataset. For example, the sample may be a data set from an interview survey conducted in a company and it is generally known who participated in these interviews because the participants missed half a day of work. In such a case it is known within the company, and to an internal intruder, who is in the disclosed data set.
- The individuals in the disclosed data set self-reveal that they are part of the sample. For example, subjects in clinical trials do generally inform their family, friends, and even acquaintances that they are participating in a trial. One of the acquaintances may attempt to re-identify one of these self-revealing subjects. Individuals may also disclose information about themselves on their blogs and social networking site pages which may self reveal that they are part of a study or a registry. However, it is not always the case that individuals do know that their data is in a data set. For example, for studies where consent has been waived or where patients provide broad authorization for their data or tissue samples to be used in research, the patients may not know that their data is in a specific data set, providing no opportunity for self-revealing their inclusion.
In the above conditions the term population is used loosely. It does not mean the population of a geographic area, but the group of people who have a specific known characteristic. For example, a data set of all patients with renal cancer for a province would be a renal cancer population registry since everyone with renal cancer would be in that registry. A data set of all patients with a particular disease within a geographic boundary, or that have a particular demographic (e.g., ethnicity, language spoken at home, age group) would be considered a population and therefore the data set would meet criterion (1) above.
If a data set does not meet the above criteria, then you should be concerned about journalist risk and not prosecutor risk (i.e., it is either one or the other, not really both). The distinction between the two types of risk is quite important because the way risk is measured or estimated does differ and there can be a big difference in the risk assessment results based on which type applies.
It is often the case that custodians hold a certain type of data, and therefore, once they have decided that their data falls under prosecutor or journalist risk, they can apply that type of risk assessment moving forward.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material.