Postal codes are the smallest geographic unit that is used by Canada Post to deliver mail. In a health care context they are the most common geographic unit because that is what patients know and are able to provide. Therefore it is often collected.
The re-identification scenario that is relevant here is where there is a data set which is being disclosed, and this data set contains the postal codes of individuals and some sensitive information, say an indicator of whether a person has a sexually transmitted disease. The postal code is the only demographic information that is being disclosed in this data set. Does this represent a high re-identification risk?
The median number of people who live in a postal code is quite small, as shown in the table below. This uses 2006 census data. The first observation is the wide variation in the postal code sizes within provinces and across provinces.
For example, if there are 20 people who live in a postal code, does that represent a re-identification risk? If that is the only demographic information available, then an intruder would guess, and the probability of making a correct guess that a record belongs to a particular person would be 1 in 20. By most standards used for managing re-identification risk, this would be considered a small number.
|Province/Territory||# Postal codes||Min||25th Percentile||Median||75th Percentile||Max|
But if we look at that table again, we see that 25% of the postal codes in New Brunswick have a population of 3 or less. Guessing that a person matches a record with a success probability of 1 in 3 is quite high and would be considered a high re-identification risk. And this high risk applying to 25% of the postal codes would be a problem. Similarly, 25% of the postal codes in Alberta have a population of 5 or less. The smallest postal codes in all provinces and territories have very few people living there. Any information about the postal code would pertain to a very small number of individuals.
Therefore, whether a postal code by itself can represent a high re-identification risk will depend on where in the country one is located. Some postal codes have very few people living in them. In that case knowing the postal code narrows down the options to very few individuals that there is a good chance that guessing will be correct.
At the other extreme, if we have a data set where everyone in a postal code (or a large proportion, say 90% of the people living in that postal code) have a sexually transmitted disease, then it is not necessary to know which record in the data set pertains to a particular individual because any individual living in that postal code will very likely have the disease.
Therefore, to summarize, disclosing a data set with only the postal code and some sensitive information can have a high re-identification risk if:
- The postal code has very few people living in it. Few here would typically be defined as five or less. There are postal codes with only a handful of people. In some provinces and territories the percentage of such small postal codes is quite high.
- If the disclosed data set pertains to a postal code with many people living in it and the disclosed data set indicates that the majority of these people have the same condition or disease. In that case we can still draw sensitive conclusions about the individuals in that postal code without actually re-identifying their record in the disclosed data set.