| |
| |
How can I de-identify longitudinal records?
| Views: 1153 Created: 24-10-2009 19:00 Last Updated: 23-09-2011 15:16 |
|
|
At the outset, it is important to make a distinction between three types of longitudinal records that occur often in practice.
The first type consists of specific variables that are collected from all patients at specific points in time. For example, if function and quality of life data is collected every year as part of a cancer survivor study, then the same variables are collected from patients every year. For this type of data existing de-identification algorithms will work well, such as k-anonymity algorithms which use generalization and suppression.
We have two kinds of quasi-identifiers: the basic ones and the yearly ones. The basic quasi-identifiers will likely consist of demographics that do not change, such as date of birth and gender. The yearly ones would include things like: where the patient lives at that point in time and maybe some socio-economic variables. When using one of the existing de-identification algorithms (for identity disclosure control; see the discussion here) the yearly variables should be linked or correlated so that they are de-identified the same way. This linking will make it easier to analyze the data since having the same variable at different levels of generalization, for example, is not very useful from an analysis perspective.
The second type of longitudinal data consists of visits for each patient but there is an anchor visit. The difference between the above (first type) data and visit or encounter data is that each patient can have a different number of visits. This makes it difficult to analyze this kind of data with current de-identification algorithms. Having an anchor visit, however, can solve that problem. For instance, for cancer patients an anchor visit would be when diagnosis occurred. With an anchor we can then compute all other visits as relative dates. For example, visit 1 after diagnosis would be +15 days. The actual date of diagnosis is not disclosed and only intervals are disclosed.
If the only demographic information that needs to be analyzed for each visit is its date, and there is an anchor visit, then converting all dates to relative dates makes it unnecessary to de-identify the visit data. The reasoning is that an intruder would be very unlikely to know something as specific as intervals between visits to launch a re-identification attack, and intervals are less likely to be unique than actual dates.
The third type of longitudinal data is the same as the one above except that there is no anchor visit and/or it is necessary to disclose additional demographics for each visit. A good example of this is EMR data. Here there is no real anchor event to use across all patients, and often other demographics are useful to disclose about each visit, for example, where the patient lives, how many children they have, whether they are married - all of which can change from visit to visit.
For the third type of data there are a number of new algorithms that are being developed that will specifically deal with this kind of data. But this is still at the research stage.
The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.
|
| |
|