With journalist risk the intruder is not looking for a specific person in the disclosed data set; re-identifying any person will achieve the goal. A classic example is the reporter who is going through a leaked medical database to find someone with a sensitive disease or condition. Once the reporter finds that person, then he'll try to re-identify that person. The reporter did not have a specific person in mind to start off with.
We assume that the reporter has access to another database that can be used for matching. For example, in the US this database may be a voter list. The voter list will have basic demographics and the voter's name and address. The intruder matches the records in the database with the disclosed data set and for correct matches individuals are re-identified. This database used for matching is called an identification database.
The quasi-identifiers under journalist risk are those that can be found in the identification database. There three different types of identification databases:
- A public database that is available freely and without conditions.
- A semi-public database that may require a fee to access or that has conditions on its uses.
- A private database, which is in the possession of the intruder or can be bought from a data broker for a fee.
It is difficult to predict all of the private databases that can exist, therefore we often focus on the public and semi-public ones. Our research (see http://www.jmir.org/2006/4/e28) has documented the quasi-identifiers that are available in Canada through public sources and that can be used for re-identification. These variables include:
- Date of birth and date of death.
- Profession.
- Home address and telephone number.
- Type of dwelling. This information can be obtained in aggregate from Statistics Canada or by looking at Google Maps.
- Gender: If an identification database does not have that, genderizing software can predict it from the first name.
- Ethnicity. If an identification database does not have that, ethnically sensitive genderizing software can predict it from the first and last names.
- Incomes for highly paid civil servants.
In Canada the voter lists are not publicly available. Therefore only specific people are really at risk of re-identification through a journalist type attack using public sources of information: (a) homeowners, (b) members of a profession which publishes its membership lists, and (c) civil servants. Examples of specific public sources (many of these are linked to from our main web site here: http://www.ehealthinformation.ca/ap0/datasources.asp) are:
- Obituaries. These are available from newspapers, funeral homes, specialized tombstone sites, and obituary aggregation sites.
- The Private Property Security Registration database. This is available from provincial governments, either directly on-line or through a local agent.
- Land Registry. This contains information on house ownership.
- Professional membership lists, for example, for doctors and lawyers.
- Salary disclosure reports from governments.
- White pages. These do not include cell phone numbers (yet).
- On-line CVs. Job sites provide a lot of very detailed information. Individuals also post CVs on their personal sites and pages.
- Donations. These include donations to political parties which are disclosed by Elections Canada on their web site.
Some of the above data sources become useful as identification databases when they are linked together rather stand alone.
It should also be kept in mind that many public data sources have a fee, and to create a meaningful identification database can be quite expensive. This presents a deterrent for an intruder to attempt journalist type re-identification unless the re-identification will have even higher returns. As an example, searching the PPSR in Ontario cost $8 each time. Therefore, to create an identification database for active Ontario physicians (~23,000) would cost $184,000. This makes it important to consider the plausibility of creating certain types of identification databases when evaluating journalist risk. For many intruders, this kind of expense would not be worth it.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.