KMP http://www.ehealthinformation.ca/knowledgebase/ en-us KnowlageBase RSS Generator What de-identification software tools are there ? http://www.ehealthinformation.ca/knowledgebase/article/AA-00118 There are five de-identification tools that are generally available. These tools work on structured data. There are other tools that focus specifically on free-form text, but these are not covered here.

Also, it is important to make a distinction between de-identification tools and masking tools. The latter do not really provide adequate protection for personal information. There are many masking tools on the market (about two dozen vendors with tools with a wide variability in functionality). A more detailed description of the difference between de-identification and masking is described in this article: http://www.ehealthinformation.ca/documents/parat/riskdeid.pdf

Beyond the five de-identification tools described below, the tools that exist are internal to organizations and therefore are not generally available, or have been developed for personal use (by researchers) and therefore have not been applied broadly.

The five generally available de-identification tools are:

  • mu-Argus, developed by the Netherlands national statistical agency. More information about mu-Argus can be found here:
    http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf

    and the tool itself can be downloaded from here: http://neon.vb.cbs.nl/casc/Software/MU420_B1.zip


  • The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. It is an open source tool available here: http://sourceforge.net/projects/anony-toolkit/
    with documentation available here: http://www.cs.cornell.edu/bigreddata/publications/2009/sigmod2009-p1051-xiao.pdf


  • The University of Texas at Dallas Anonymization Toolbox, which contains open source Java implementations of some k-anonymity and attribute disclosure control algorithms, with documentation: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php


  • The sdMicro package in R provides some basic de-identification functions. You can download it from here:
    http://cran.r-project.org/web/packages/sdcMicro/


Tools Assessment

The only tool that is commercially available and actively supported is PARAT from Privacy Analytics. Another useful point of comparison is that the algorithm implemented in PARAT has been shown in a recent article to perform better than the algorithm implemented in CAT (see http://www.jamia.org/cgi/content/short/16/5/670). Furthermore, the risk estimator used in PARAT has been shown to produce more accurate de-identification results than the one incorporated in mu-Argus (see http://www.jamia.org/cgi/content/abstract/15/5/627).

The UTD toolbox includes some of the same algorithms as CAT. This toolbox contains a set of capabilities rather than a tool that is ready to use by an end-user (e.g., an analyst), and therefore is targeted more at developers. It is also not actively supported as a product.

We spent some time evaluating the CAT tool. There are a significant number of usability issues with the tool. For example, we were unable to find the location where the value of k for the k-anonymity algorithm was defined, it was not possible to view data by equivalence class, and the data views gave the same record id every 60 records. There is an inability to import standard data files. The lack of documentation and support made using the tool difficult. We also found it quite buggy. While this may have been good to complete a Master's thesis project, it clearly lacked important functionality and robustness for broader use.

The sdMicro package cannot handle large data sets and will crash often. We've had a lot of problems working with it on our data sets. It is a decent tool for experimenting with de-identification techniques but is not suitable if you want to de-identify real data sets.

Note that de-identification tools are different from masking tools. The first attached document provides an overview of de-identification techniques and explains at some length the differences between these two approaches and when each is more suitable.

The second attached document is a report produced by Canada Health Infoway that contains an overview of de-identification techniques as well as a summary of the tools that are available on the market today.


 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Sun, 18 Oct 2009 00:00:00 -0400
Risky Business Newsletter - September 2011 http://www.ehealthinformation.ca/knowledgebase/article/AA-00200 Risky Business is the re-identification risk management newsletter that we produce with Privacy Analytics Inc.

You can download it from here: http://www.privacyanalytics.ca/riskybusiness/september-2011.pdf

Topics in September 2011 newsletter are:

  • Managing data quality in data warehouses.
  • Lessons from the privacy professor - perspectives from an experienced privacy professional.
  • Case study of de-identifying data for dislcosures for research purposes from the BORN Ontario registry.

 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Wed, 28 Sep 2011 07:46:06 -0400
Risky Business Newsletter - August 2011 http://www.ehealthinformation.ca/knowledgebase/article/AA-00198 Risky Business is the re-identification risk management newsletter that we produce with Privacy Analytics Inc.

You can download it from here: http://www.privacyanalytics.ca/riskybusiness/august-2011.pdf

Topics in August 2011 newsletter are:

  • Legislative uncertainty on de-identification provides opportunities
  • Wizards provide researchers with control of privacy issues
  • Upgrades to re-identification risk assessment and de-identification software
  • Case study: the Canadian Primary Care Sentinel Surveillance Network (CPCSSN)
  • Myth-busting whitepaper 

The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Wed, 31 Aug 2011 06:05:53 -0400
Risky Business Newsletter - July 2011 http://www.ehealthinformation.ca/knowledgebase/article/AA-00191 Risky Business is the re-identification risk management newsletter that we produce with Privacy Analytics Inc.

You can download it from here: http://www.privacyanalytics.ca/riskybusiness/july-2011.pdf

Topics in this month's  newsletter are:

  • the new release of the PARAT de-identification tool,
  • the use of our de-identification tool when disclosing cancer data by the cd-link project in Ontario,
  • the IRB and REB Wizards available on-line, and
  • our work on the $3m Heritage Health Prize.
 

The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Sat, 30 Jul 2011 06:51:17 -0400