Browse
 
Tools
Rss Categories

How can I safely release data to multiple researchers - scenario I?

Views: 792 Created: 05-02-2010 19:00 Last Updated: 23-09-2011 15:19

First, let's consider the scenario. We have a data custodian who wants to disclose data to researcher A and researcher B. Each researcher will get a different set of variables. But the two data sets pertain to the same individuals/patients. This is a rather simple scenario because there are no overlapping variables among the two researchers.

We then assume that there will be collusion between A and B in that they will bring their two data sets together and try to match the records. If they are successful then the variables in the two data sets will be known for all of the individuals in the data. This may not have been the intention of the data custodian (i.e., researcher A may not have had permission to see the variables given to researcher B).

Of course, in the above scenario you can replace researcher with any other type of data recipient, such as government department, journalist, or a combination of data recipient types. The same principles apply.

The way to handle this particular scenario is to shuffle the records that are disclosed to each researcher. This way if the researchers try to match the two records they cannot use the positional information for that purpose.

If the data sets given to both researchers pertain to exactly the same individuals and they are both of size N, then if the two researchers try to match the records randomly the proportion of records that would be correctly matched is 1/N, on average.

If researcher A had N records on N people and researcher B had n records on n people where n<N then the proportion of B's records that would be correctly matched is still 1/N, on average, and the proportion of A's records that would be matched correctly is n/(N^2), on average.

Therefore, as the size of the disclosed data sets increase, the proportion of correct matches decrease. Furthermore, it should be noted that under this scenario the intruder would not know which records were correctly matched, only that a certain proportion of them were correctly matched.

If the proportion of records that will be matched successfully after shuffling is small and if the researchers will not know which records were matched successfully, then this acts as a strong dis-incentive to match the two data sets. Therefore, always shuffle your data before disclosure.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.