Browse
 
Tools
Rss Categories

Why can't we just add noise to the data to de-identify it?

Views: 2162 Created: 30-10-2009 19:00 Last Updated: 23-09-2011 15:52

A method that is sometimes used to de-identify data sets is to add noise to the values of the variables. For example, a random number of days are added to a date of birth to create a perturbed date of birth.  You can also add noise to location data by moving a postal code to a randomly selected adjacent postal code.

In practice we have found that the data recipients do not like this approach to de-identification because they cannot trust the data anymore. For example, if we have a 50 year old male with cancer, it would not be known whether he was really fifty years old or 55 years old or 45 years old. The shift in age may make a difference in the analysis and in the conclusions drawn. Data recipients are concerned about drawing incorrect conclusions from the data because of perturbation.

The same mistrust issues come up with another technique called "microaggregation". The basic idea here is to identify a cluster of similar records in the data set and then replace the actual values with the average (or median) of that cluster. For example, the age would be replaced with the average age of the cluster. This is similar to the approach called "hot-deck imputation" that is used to deal with missing data. Again, the data recipients' reaction has been that they cannot trust the values. If a cluster inadvertently contains an outlier or an influential observation then the average may be distorted excessively, and potentially incorrect or inaccurate conclusions drawn.

Along the same lines as above, this is the reason why we have found data recipients and analysts reluctant to use synthetic data. In principle, synthetic data is not real data and therefore there are no identity disclosure risks with releasing it. Also, in principle the basic (bivariate) correlational structure of the data is maintained in the synthetic data. But if an analysis is complex, the distributions are non-standard, and the multivariate correlations structure is not captured in the synthetic version, then some relationships may not be detected or incorrectly detected in the synthetic version.

The approach that is used more often, at least in the context of health data sets, is to generalize the variables. So say a cancer patient born on 1st January 1959 may be generalized to just January 1959, or even just 1959. However, that number is still true, but has less precision. Therefore, the data can be trusted and the risk of drawing incorrect conclusions is reduced.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.