Sampling means drawing a subset of the rows from the data set and disclosing those instead of the complete data set. The reason why sampling is sometimes used is because it thwarts a prosecutor type attack by making it difficult for an intruder to know if a specific target individual is in the disclosed data set. For example, if we want to disclose a breast cancer data set and the intruder knows that Alice has breast cancer, then the intruder will know that there is a record belonging to Alice in a population registry of breast cancer patients. However, if only a sample is disclosed then the intruder would not know if Alice’s record is in the disclosed sample. Therefore, it will not be known if there is a record in the sample matches Alice’s particulars (say postal code and date of birth) that it is truly Alice’s record or not. Such uncertainty would in principle deter an intruder from even attempting a re-identification of Alice’s records.
Sampling is only effective if the sampling fraction is relatively small. For example, if the sampling fraction is 99% then the disclosed data set is almost as good as the population registry. However, say, a 25% sample would create considerable uncertainty as to whether Alice is in the sample or not.
There are three problems with sampling, however. The first is that many data users, for example, researchers, would be very upset if a large data set existed and they were only given a small subset of it for their analysis. The smaller sample means a reduction in statistical power, and hence, the ability to detect statistically significant relationships or effects is reduced. If the power is too low then it may not be worth doing an analysis at all.
The second issue is that if Canadians are unique on say their postal code and date of birth, and these two variables are included in the sampled data set, then the intruder may still look for Alice’s record in the disclosed sample. If the intruder finds a match then the intruder will know for certainty that this record belongs to Alice. If the intruder does not find a match then he will know that Alice was not included in the sample. Therefore, sampling does not provide protection when a sample unique is known to be a population unique.
The third issue is that sampling does not protect against journalist risk. An intruder may attempt to match the disclosed sample with another public database. Depending on the variables that are included in the disclosed sample, the intruder may find a correct match and re-identify individuals. In fact, probabilistically, the sampling fraction in that case has little impact on the probability of a correct match.
Therefore, sampling should be considered as only one of a number of strategies that can be used to manage re-identification risk. It is not sufficient to rely on sampling only as a means to protect disclosed data from re-identification.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.