In many discussions about re-identification risk and de-identification the focus is on demographic variables. But many data sets also include diagnosis codes (for example, ICD-10 codes). We will answer the question on whether these can be used for re-identification by going through a number of scenarios. In all of these scenarios (below) we assume that the disclosed data set has demographics as well as at least one diagnosis code.
In the US, hospital discharge abstract data is publicly available. Also, many states make their voter lists available for free or for a modest fee. By linking the demographic information in the discharge abstracts with that in the voter lists, it is possible to construct an identification database containing names, addresses, dates of birth, gender, and diagnosis codes for patients. With that kind of background information, it is then possible to match the diagnosis codes and demographics with the same information in any data set that is disclosed and that has diagnosis codes. In this case, the diagnosis code effectively becomes yet another quasi-identifier. For example, if a hospital makes a data set of patients available for research, an intruder can create the abovementioned identification database and match against the research data set. This is a classic example of a journalist-type re-identification attack.
The inclusion of diagnosis codes makes the probability of correct re-identification much higher because each person will have more than one (in fact, many) diagnosis codes included in their discharge abstracts. A set of diagnosis codes can make an individual unique.
The underlying assumption with the above scenario is that the disclosed data set is for a particular institution and the intruder has created an identification database using public information for that same institution. The intruder then does the matching on institution-specific data sets. It is not clear how high the risk would be if there was no institutional information.
The above scenario would not happen in Canada because discharge abstract data is not easily available, and the organization which releases this information on a national basis, the Canadian Institute for Health Information, does implement disclosure control on that data and limits who gets access to it. Furthermore, voter lists are not readily available in Canada (but see the discussion here). It does not mean that journalist risk for discharge abstracts does not exist; only that it is very low and probably not the highest priority risk to focus on.
Another scenario is when some of the records in the disclosed data set have diagnosis codes for rare and visible diseases/conditions. If the data set has location information as well, such as the postal code or town where the patient lives, then having a diagnosis code for a rare and visible disease/condition means that it would be relatively easy to find the patient. A reporter or investigator can ask local people in that geography if they know a person who has the visible characteristics of the disease/disorder. Since it is rare, the individuals who have the disease/condition will stand out. This type of re-identification makes sense if the disclosed data set has many diagnosis codes and/or other sensitive information, otherwise an intruder would not discover something new by re-identifying the record. You can find additional discussions on rare and visible diseases and conditions in this article.
The third scenario we will consider is when a diagnosis code can be associated with a genetic marker. For example, if a patient is diagnosed with Huntington's disease, there is a clear genetic marker for that. An intruder who gets the DNA of a patient and determines that they have Huntington's would then be able to use that as a diagnosis quasi-identifier to re-identify the individual in the disclosed data set. This scenario, however, assumes that the intruder has a specific patient in mind and has that patient's DNA. The disclosed data set must also contain additional sensitive information beyond the fact that the patient has Huntington's otherwise the intruder has not learned anything new.
Therefore, the answer to the question on whether diagnosis codes can be used to re-identify patient will depend in the circumstances described above.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.