What de-identification software tools are there?

There are five de-identification tools that are generally available. These tools work on structured data. There are other tools that focus specifically on free-form text, but these are not covered here.

Also, it is important to make a distinction between de-identification tools and masking tools. The latter do not really provide adequate protection for personal information. There are many masking tools on the market (about two dozen vendors with tools with a wide variability in functionality). A more detailed description of the difference between de-identification and masking is described in this article.

Beyond the five de-identification tools described below, the tools that exist are internal to organizations and therefore are not generally available, or have been developed for personal use (by researchers) and therefore have not been applied broadly.

The five generally available de-identification tools are:

  • The PARAT tool from Privacy Analytics Inc. implements comprehensive risk management for three types of identity disclosure risk. More information about this product is available from here.
  • mu-Argus, developed by the Netherlands national statistical agency. More information about mu-Argus can be found here and the tool itself can be downloaded from here.
  • The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. It is an open source tool available here, with documentation available here.
  • The University of Texas at Dallas Anonymization Toolbox, which contains open source Java implementations of some k-anonymity and attribute disclosure control algorithms, with documentation.
  • The sdMicro package in R provides some basic de-identification functions. You can download it from here.
Tools Assessment

The only tool that is commercially available and actively supported is PARAT from Privacy Analytics. Another useful point of comparison is that the algorithm implemented in PARAT has been shown in a recent article to perform better than the algorithm implemented in CAT. Furthermore, the risk estimator used in PARAT has been shown to produce more accurate de-identification results than the one incorporated in mu-Argus.

The UTD toolbox includes some of the same algorithms as CAT. This toolbox contains a set of capabilities rather than a tool that is ready to use by an end-user (e.g., an analyst), and therefore is targeted more at developers. It is also not actively supported as a product.

We spent some time evaluating the CAT tool. There are a significant number of usability issues with the tool. For example, we were unable to find the location where the value of k for the k-anonymity algorithm was defined, it was not possible to view data by equivalence class, and the data views gave the same record id every 60 records. There is an inability to import standard data files. The lack of documentation and support made using the tool difficult. We also found it quite buggy. While this may have been good to complete a Master’s thesis project, it clearly lacked important functionality and robustness for broader use.

The sdMicro package cannot handle large data sets and will crash often. We’ve had a lot of problems working with it on our data sets. It is a decent tool for experimenting with de-identification techniques but is not suitable if you want to de-identify real data sets.

Note that de-identification tools are different from masking tools. The first attached document provides an overview of de-identification techniques and explains at some length the differences between these two approaches and when each is more suitable.

The second attached document is a report produced by Canada Health Infoway that contains an overview of de-identification techniques as well as a summary of the tools that are available on the market today.

Further Reading: 2009 – Tools for De-Identification of Personal Health Information