KMP http://ehealthinformation.ca/knowledgebase/?action=getCategory&data=0 en-us KnowlageBase RSS Generator What de-identification software tools are there ? http://ehealthinformation.ca/knowledgebase/article/AA-00118 There are five de-identification tools that are generally available. These tools work on structured data. There are other tools that focus specifically on free-form text, but these are not covered here.

Also, it is important to make a distinction between de-identification tools and masking tools. The latter do not really provide adequate protection for personal information. There are many masking tools on the market (about two dozen vendors with tools with a wide variability in functionality). A more detailed description of the difference between de-identification and masking is described in this article: http://www.ehealthinformation.ca/documents/parat/riskdeid.pdf

Beyond the five de-identification tools described below, the tools that exist are internal to organizations and therefore are not generally available, or have been developed for personal use (by researchers) and therefore have not been applied broadly.

The five generally available de-identification tools are:

  • mu-Argus, developed by the Netherlands national statistical agency. More information about mu-Argus can be found here:
    http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf

    and the tool itself can be downloaded from here: http://neon.vb.cbs.nl/casc/Software/MU420_B1.zip


  • The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. It is an open source tool available here: http://sourceforge.net/projects/anony-toolkit/
    with documentation available here: http://www.cs.cornell.edu/bigreddata/publications/2009/sigmod2009-p1051-xiao.pdf


  • The University of Texas at Dallas Anonymization Toolbox, which contains open source Java implementations of some k-anonymity and attribute disclosure control algorithms, with documentation: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php


  • The sdMicro package in R provides some basic de-identification functions. You can download it from here:
    http://cran.r-project.org/web/packages/sdcMicro/


Tools Assessment

The only tool that is commercially available and actively supported is PARAT from Privacy Analytics. Another useful point of comparison is that the algorithm implemented in PARAT has been shown in a recent article to perform better than the algorithm implemented in CAT (see http://www.jamia.org/cgi/content/short/16/5/670). Furthermore, the risk estimator used in PARAT has been shown to produce more accurate de-identification results than the one incorporated in mu-Argus (see http://www.jamia.org/cgi/content/abstract/15/5/627).

The UTD toolbox includes some of the same algorithms as CAT. This toolbox contains a set of capabilities rather than a tool that is ready to use by an end-user (e.g., an analyst), and therefore is targeted more at developers. It is also not actively supported as a product.

We spent some time evaluating the CAT tool. There are a significant number of usability issues with the tool. For example, we were unable to find the location where the value of k for the k-anonymity algorithm was defined, it was not possible to view data by equivalence class, and the data views gave the same record id every 60 records. There is an inability to import standard data files. The lack of documentation and support made using the tool difficult. We also found it quite buggy. While this may have been good to complete a Master's thesis project, it clearly lacked important functionality and robustness for broader use.

The sdMicro package cannot handle large data sets and will crash often. We've had a lot of problems working with it on our data sets. It is a decent tool for experimenting with de-identification techniques but is not suitable if you want to de-identify real data sets.

Note that de-identification tools are different from masking tools. The first attached document provides an overview of de-identification techniques and explains at some length the differences between these two approaches and when each is more suitable.

The second attached document is a report produced by Canada Health Infoway that contains an overview of de-identification techniques as well as a summary of the tools that are available on the market today.


 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Sun, 18 Oct 2009 00:00:00 -0400
What is a quasi-identifier? http://ehealthinformation.ca/knowledgebase/article/AA-00120 As noted in a different KnowledgeBase article (view here), the primary type of disclosure risk that needs to be focused on is identity disclosure. An underlying assumption for this type of risk is that there is an intruder who has two pieces of information: (a) the actual data set that has been disclosed and (b) some background information about one or more people in this data set. The background information is described by a set of variables. These variables are the quasi-identifiers.

Examples of common quasi-identifiers in the context of health information are: dates (such as, birth, death, admission, discharge, visit, and specimen collection), locations (such as, postal codes, hospital names, and regions), race, ethnicity, languages spoken, aboriginal status, profession, and gender.

This set of variables may expand if more data are collected about the public and more public registries are made available. For example, in many jurisdictions in the US the voter lists are publicly available. This means that the basic demographics (such as full ZIP code, date of birth, gender, all included in the voter list) are automatically considered quasi-identifiers. Individuals may also post personal information about themselves on web sites or announce that information to their friends. For example, many new parents announce the exact birth weight of their new child. When we examined birth registries, we found that weight, hospital of birth, and age of mother make most births unique. Therefore, weight is quite a powerful identifier.

It is simply wrong to make a general statement that all variables in a disclosed data set are quasi-identifiers. The reason is that it is often not plausible for an intruder to gain background information about all variables in a data set. For example, using legitimate means in Canada it would be very difficult for an intruder to obtain all of the diagnosis codes for a patient and use these for re-identification. It is very difficult for an intruder to get a complete set of lab test result values and use these for re-identification.

One of the skills required in managing re-identification risk is to decide the plausible quasi-identifiers that need to be considered. This analysis looks at the jurisdiction and the types of information that is publicly available, and should use the additional guidance that we have provided in other KnowledgeBase articles, available here for prosecutor risk and available here for journalist risk.

Therefore, to summarize, a quasi-identifier is a piece of information that an intruder can get hold of about a specific target individual or about a large number of people through the following means:

  • Personal knowledge of the specific target person (e.g., a neighbor, co-worker, ex-spouse).
  • The specific target person is famous and there is information publicly available about them.
  • Publicly available registries (e.g., voter lists and court records) or the media (e.g., obituaries published in newspapers or on-line).
  • Information that individuals post about themselves on the Internet (e.g., information they post on social networking sites).
  • Information that individuals often disclose to a large number of people (e.g., their baby's birth weight or birth date).


It is also important to remember that it is possible to predict a quasi-identifier from another variable. In this case, both of the variables must be considered quasi-identifiers. There is no point protecting against a variable A but not variable B, and the intruder can easily predict A from B. Therefore, it is important to search for correlated variables in a data set. Examples of correlated variables are:

  • Date of birth of a baby and date of discharge from a hospital.
  • Date of death and date of an autopsy.
  • Weight at birth and weight of baby at discharge from a hospital.
  • Age and date of graduation.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Mon, 19 Oct 2009 00:00:00 -0400
Is there a secondary use market for health information? http://ehealthinformation.ca/knowledgebase/article/AA-00103
An issue that has occasionally come up is whether there is a secondary use market for health information? Of course secondary use has been occurring for many years in the context of research, quality improvement, and public health. But does the data have commercial value? PriceWaterhouseCoopers has just published a report (attached) which describes the market for health information.

Based on a survey, the report notes that "Across the board, the vast majority (over 80 percent) of survey respondents cited privacy, legal implications, and public relations ramifications as concerns" for secondary use of health information, but these issues are seen as solvable problems rather than insurmountable problems.

 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Tue, 13 Oct 2009 00:00:00 -0400
Why can't we just add noise to the data to de-identify it? http://ehealthinformation.ca/knowledgebase/article/AA-00130 A method that is sometimes used to de-identify data sets is to add noise to the values of the variables. For example, a random number of days are added to a date of birth to create a perturbed date of birth.  You can also add noise to location data by moving a postal code to a randomly selected adjacent postal code.

In practice we have found that the data recipients do not like this approach to de-identification because they cannot trust the data anymore. For example, if we have a 50 year old male with cancer, it would not be known whether he was really fifty years old or 55 years old or 45 years old. The shift in age may make a difference in the analysis and in the conclusions drawn. Data recipients are concerned about drawing incorrect conclusions from the data because of perturbation.

The same mistrust issues come up with another technique called "microaggregation". The basic idea here is to identify a cluster of similar records in the data set and then replace the actual values with the average (or median) of that cluster. For example, the age would be replaced with the average age of the cluster. This is similar to the approach called "hot-deck imputation" that is used to deal with missing data. Again, the data recipients' reaction has been that they cannot trust the values. If a cluster inadvertently contains an outlier or an influential observation then the average may be distorted excessively, and potentially incorrect or inaccurate conclusions drawn.

Along the same lines as above, this is the reason why we have found data recipients and analysts reluctant to use synthetic data. In principle, synthetic data is not real data and therefore there are no identity disclosure risks with releasing it. Also, in principle the basic (bivariate) correlational structure of the data is maintained in the synthetic data. But if an analysis is complex, the distributions are non-standard, and the multivariate correlations structure is not captured in the synthetic version, then some relationships may not be detected or incorrectly detected in the synthetic version.

The approach that is used more often, at least in the context of health data sets, is to generalize the variables. So say a cancer patient born on 1st January 1959 may be generalized to just January 1959, or even just 1959. However, that number is still true, but has less precision. Therefore, the data can be trusted and the risk of drawing incorrect conclusions is reduced.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Sat, 31 Oct 2009 00:00:00 -0400
Which type of threshold should we use for de-identification? http://ehealthinformation.ca/knowledgebase/article/AA-00105

Many types of thresholds have been suggested and used for deciding when a data set is de-identified. Some common ones are:

  • Cell size of 5, 3, or 10
  • Uniqueness
  • Rareness
     

A question that comes up in practice is "which threshold should we use?".

 

In fact, all three of these are related. The general rule is:

 

 X% of the records are in cell sizes >= k (or equivalence classes of size k)

A common instantiation, called 5-anonymity is:

 

 100% of the records are in cell sizes >= 5

 

This means that every possible value on the quasi-identifiers occurs at least five times.

 

The uniqueness criterion can be stated as 2-anonymity:

  

  100% of the records are in cell sizes >= 2

 

Although, there are cases where 95% and 80% are acceptable values for X.

 

For example, some cancer registries release their data to researchers if less than 20% of their records are unique, and to the public if less than 5% of their records are unique.

 

The third criterion, rareness, means one has to ensure that there are no rare records. The general rule here is:

 

 all equivalence classes have >X% of the records in the population

 

This rule ensures that there are no equivalence classes that are relatively rare. Rareness is often defined in terms of the population not in terms of the records in the data set.

 

For example, some national statistical agencies will not disclose census information if any equivalence classes cover less than or equal to 0.5% of the population. This is the rule used to justify not releasing individual ages above 89 years because very few people live beyond that age (i.e., fewer than 0.5% of the population are in each of the 90+ age range).

 

The question is, which one of the above rules should be used and what values should be relied on? There are no hard rules on this, but a reasonable approach is to use precedent.

 

The argument for using precedent is that it signifies acceptability. If a particular rule has a lot of precedent then it suggests that society has accepted the level of risk implied by the rule. For example, there is a lot of precedent spanning multiple decades for the cell size of five rule, so it is safe to assume that this is a generally accepted level of risk.

 

Precedent may be specific to a certain type of data or registry. For example, some precedents may be more acceptable for the disclosure of cancer registry data, but may not be acceptable for sexually transmitted disease or mental health data. Also, of course, it will depend on who the data is being disclosed to.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Tue, 13 Oct 2009 00:00:00 -0400
Definition of identifiable dataset - if a person can find their record(s) in the dataset http://ehealthinformation.ca/knowledgebase/article/AA-00101 One question that sometimes comes up is whether a data set can be considered identifiable if a person can find their own record(s) in there.  This definition can be analyzed from a number of different perspectives.

A person may not know if they are in a data set if the data set is a sample. One example is if a data set is based on chart reviews from a random subset of patients at a clinic, then any randomly selected patient in that clinic will not necessarily know that they are in the data set created from the chart review. This uncertainty means that the above definition of identifiability may not be appropriate. One primary reason is that if a patient finds a record that matches their own characteristics they will not know if that record really belongs to them or to someone else.

As a caveat, if the clinic is small then the chances of another patient having exactly the same characteristics would also be small. Also, if the number of variables extracted from the charts is large, then it is less likely that there would be another patient at the clinic who is similar on all of the variables.

Another scenario is if a patient does know that s/he is in a data set. There are a couple of ways that a patient can know that their record is in that data set:

  • If the patient knows that they are unique in the population, and they find a match in the chart review sample, then they would be confident that their own record has been discovered. But this assumes that the patient knows that they are unique in the population. There are some circumstances where that knowledge is reasonable. For example, Canadians living in urban areas are unique on their date of birth and residence postal code. Therefore, a patient can be confident that if they use these two variables and s/he found a match in the chart review sample, then it is almost certain that it is him/her.
  • If the data set is not a sample but a whole population, say as in a population registry, then the patient would know for sure that they are in the data set, for example, if the data set is a provincial cancer registry. If the patient finds a single record that matches then s/he will know for sure that it was his/her record. If the patient is unique, then the patient will know for sure that s/he has discovered his/her own record.

 

Let us assume that one of the above two conditions is true. In that case, is the fact that the patient can find their own record a workable definition of an identifiable data set?

If we accept the above definition then we are setting a high standard. A common way we model an intruder is to consider the kind of background knowledge the intruder would have about the target person (or persons) being re-identified. The more background information the intruder has the greater the re-identification risk. A person will have the maximum possible background information about themselves (i.e., if the intruder is also the target person being re-identified); much more than any other intruder would know. It is true that many people tell their friends and family many things, but they do not tell them absolutely everything. Therefore, the background knowledge of a person about themselves represents the maximum possible background information and therefore the maximum possible risk. If one wants to be conservative, then this is a good approach. But in many cases assuming that an intruder will know absolutely everything does not seem very plausible and sets quite a high standard. In fact, the standard would be so high that we would not be able to share any information at all unless:

  • the data set disclosed is a random sample so that an individual would not know if their record is within the data set (i.e., no population registry could be considered de-identified almost by definition),
  • the sample data set does not include many variables so that there would be other individuals with the same characteristics in the population (e.g., the clinic example mentioned above), and
  • the underlying population is large enough that the chances of an individual being unique are quite small.


A counterargument that can be made is that people are now voluntarily (and involuntarily through their friends and colleagues) revealing more and more about themselves on their blogs, Facebook pages, and Tweets. This is certainly the case and more and more is being revealed every day. Whether this type of self-exposure of personal information amounts to individuals revealing everything about themselves such that an intruder has the same background knowledge as the person themselves remain an empirical question. Although it is easy to argue that we have not quite reached that point yet.

Another scenario to consider is when the following two conditions are met:

  • the data set has some quasi-identifiers and some sensitive information (an example of the quasi-identifiers would be the demographics),
  • there are only two individuals in the data set that have exactly the same values on the quasi-identifiers,
  • one of those individuals, say Bob, gets the data set, and
  • Bob knows the second person who has the same characteristics, Joe.


Under these conditions, Bob would discover the sensitive information about Joe with certainty. Therefore, re-identifying one's own record resulted in the disclosure of sensitive information about another individual.

The approach we have taken is to define plausible intruders (or archetypes of intruders) and assess what type of background knowledge they would have. The three we consider are: a neighbor, an ex-spouse, and a reporter. The reason we selected these three intruders is because all of the re-identifications that have actually happened and that have been publicly acknowledged have been done by researchers, reporters, or in court cases. All three acknowledged types of intruders can use publicly available information (e.g., in public registries). All three acknowledged types of intruders can talk to neighbors or ex-spouses to get additional information. Therefore, by focusing on these three types of intruders we are addressing plausible risks that we know have happened.

Furthermore, we always ensure that there are always more than two records with the same values on the quasi-identifiers. That way the re-identification of one's own record does not facilitate the discovery of new information about someone else.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Mon, 12 Oct 2009 00:00:00 -0400
Research Ethics Board Wizard - Re-identification risk assessment without data http://ehealthinformation.ca/knowledgebase/article/AA-00172 Research Ethics Boards (REBs) often have to make decisions about re-identification risk before any data is collected. For many REBs the majority of their protocols are not "secondary use" protocols whereby a database exists and the investigator wishes to analyze that data. Rather, many are prospective studies where new data will be collected. Traditional re-identification risk assessment tools and de-identification tools could not handle that situation because they required the data to already exist - until now.

The REB Wizard tool that is illustrated in this video provides REBs the capability to assess re-identification risk by just describing the fields that will be collected and which part of the country (REB Wizard only exists for Canada at this point) the data will be collected from. Based on extensive analysis of the Canadian census, we have constructed models that would then provide an estimate of the percentage of the population that is at high risk of re-identification. In this case re-identification risk is measured in terms of uniqueness.





The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Tue, 05 Oct 2010 00:00:00 -0400
What is the re-identification risk from small simple counts of disease cases? http://ehealthinformation.ca/knowledgebase/article/AA-00104
A custodian has been asked to release counts of people with a particular disease. For example, in the year 2008 4 people had that particular disease in Ontario. Since the count is less than five, is there a re-identification risk in disclosing this information? To make the example concrete, let's say it is a rare disease, like botulism. Note that the custodian has not broken down the numbers by age group, gender, or other demographic - these are simple counts.

When analyzing a situation like that we have to: (a) determine the plausible intruder scenarios, (b) decide if the intruder will identify an individual, and (c) whether such an identification would allow the intruder to discover something new about the patient. Based on such an analysis one can determine whether a plausible re-identification risk exists.

To make the intruder scenario concrete, let's assume that the intruder is Bob and he is trying to re-identify Alice. The only way that Bob would know that Alice is in the database is if he already knew that she had botulism. Therefore, Bob would not gain any additional information by knowing that she is one of these four cases.

In such a situation, disclosing counts, even if they are for rare diseases and if the counts are below 5, does not necessarily represent a plausible disclosure risk.

If we extend the example and say that the geography is not the whole of Ontario but a small town, and that everyone in town knew that a certain four people fell ill after eating at a restaurant because their names were posted in the local newspaper. But no one knew what was wrong with them. Then if, say, a public health agency reports that four people had botulism in that town, then that would be problematic. The reason is that an intruder only needs to know that Alice was one of the people affected by a food borne illness at that restaurant, and the disclosed number lets the intruder find out something new about Alice - that she has botulism.

Therefore, the disclosure of small counts (i.e., less than 5) does not necessarily indicate a disclosure risk. But, as the above example illustrates, the specifics of the case may change that general conclusion.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Mon, 12 Oct 2009 00:00:00 -0400
What are the different types of disclosure risk? http://ehealthinformation.ca/knowledgebase/article/AA-00119

There are two general kinds of re-identification risk that are of concern. The first is when an intruder can assign an identity to any record in the disclosed database. For example, the intruder would be able to determine that record number 7 in the disclosed database belongs to patient Alice Smith. This is called identity disclosure. The second type of re-identification is when an intruder learns something new about a patient in the database without knowing which specific record belongs to that patient. For example, if patients from a particular area in the emergency database had a certain test result, then an intruder does not need to know which record belongs to Alice Smith, if she lives in that particular area then the intruder will discover sensitive information about her. This is called attribute disclosure.


All known examples of re-identification are identity disclosures. Therefore, a strong case can be made that this is the type of disclosure risk that we should focus on first, and should try to manage.

There are three sub-types of risk under identity-disclosure: (a) prosecutor risk, (b) journalist risk, and (c) marketer risk. Another KnowledgeBase article explains the difference between prosecutor and journalist risk: [view here].

k-Anonymity algorithms are often used to manage the risk of re-identification for prosecutor and journalist risk. 



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Mon, 19 Oct 2009 00:00:00 -0400
What are the quasi-identifiers that I should use for managing prosecutor risk? http://ehealthinformation.ca/knowledgebase/article/AA-00111
If you are trying to manage prosecutor risk, then you assume that the intruder has a specific target person in mind and is trying to re-identify that person's records in the disclosed data set. The intruder is also able to get some background information about that target person. This type of background information represents the quasi-identifiers you are interested in. The type of background information we assume depends on the intruder. The types of intruder we usually consider when managing prosecutor risk are: (a) a neighbor, (b) an ex-spouse, (c) an employer or colleague at work, (d) a relative, and (e) a stalker.

In a worst case scenario, a neighbor would know:

  • Address and telephone information about the target individual
  • Household and dwelling information (number of children, value of property, type of property)
  • Key dates (births, deaths, weddings, admissions, discharges)
  • Visible characteristics: gender, race, ethnicity, language spoken at home, weight, height, physical disabilities
  • Profession

 

Of course, not all neighbors are friendly or nosy, and therefore, a particular neighbor may not know all of the above things. But these are plausible things that a neighbor would know by observing and casually interacting with the target individual and their family.

What an ex-spouse would know includes:

  • The same things that a neighbor would know
  • Basic medical history (allergies, chronic diseases)
  • Income, years of schooling

 

An employer or relative would generally know less than the above two.

A stalker could be after a famous person or an estranged spouse or boy/girlfriend. The quasi-identifiers that a stalker would be the same as an ex-spouse for the latter case, and whatever information is publicly available about a famous person in the former case. We generally make an assumption that an ex-spouse would have the most background information.

If any of the above variables exist in the disclosed data set, then you should take them into account in the re-identification risk analysis.

But we also have to be pragmatic. For example, there are no easy ways to de-identify diagnosis codes. Therefore, if they exist in the disclosed data set and represent a high re-identification risk, then that risk may be better mitigated using a data sharing agreement and audits (see our risk assessment methodology) rather than through de-identification.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Wed, 14 Oct 2009 00:00:00 -0400