KMP http://www.ehealthinformation.ca/knowledgebase/?action=getCategory&data=0 en-us KnowlageBase RSS Generator What de-identification software tools are there ? http://www.ehealthinformation.ca/knowledgebase/article/AA-00118 There are five de-identification tools that are generally available. These tools work on structured data. There are other tools that focus specifically on free-form text, but these are not covered here.

Also, it is important to make a distinction between de-identification tools and masking tools. The latter do not really provide adequate protection for personal information. There are many masking tools on the market (about two dozen vendors with tools with a wide variability in functionality). A more detailed description of the difference between de-identification and masking is described in this article: http://www.ehealthinformation.ca/documents/parat/riskdeid.pdf

Beyond the five de-identification tools described below, the tools that exist are internal to organizations and therefore are not generally available, or have been developed for personal use (by researchers) and therefore have not been applied broadly.

The five generally available de-identification tools are:

  • mu-Argus, developed by the Netherlands national statistical agency. More information about mu-Argus can be found here:
    http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf

    and the tool itself can be downloaded from here: http://neon.vb.cbs.nl/casc/Software/MU420_B1.zip


  • The Cornell Anonymization Toolkit (CAT) implements a k-anonymity algorithm. It is an open source tool available here: http://sourceforge.net/projects/anony-toolkit/
    with documentation available here: http://www.cs.cornell.edu/bigreddata/publications/2009/sigmod2009-p1051-xiao.pdf


  • The University of Texas at Dallas Anonymization Toolbox, which contains open source Java implementations of some k-anonymity and attribute disclosure control algorithms, with documentation: http://cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php


  • The sdMicro package in R provides some basic de-identification functions. You can download it from here:
    http://cran.r-project.org/web/packages/sdcMicro/


Tools Assessment

The only tool that is commercially available and actively supported is PARAT from Privacy Analytics. Another useful point of comparison is that the algorithm implemented in PARAT has been shown in a recent article to perform better than the algorithm implemented in CAT (see http://www.jamia.org/cgi/content/short/16/5/670). Furthermore, the risk estimator used in PARAT has been shown to produce more accurate de-identification results than the one incorporated in mu-Argus (see http://www.jamia.org/cgi/content/abstract/15/5/627).

The UTD toolbox includes some of the same algorithms as CAT. This toolbox contains a set of capabilities rather than a tool that is ready to use by an end-user (e.g., an analyst), and therefore is targeted more at developers. It is also not actively supported as a product.

We spent some time evaluating the CAT tool. There are a significant number of usability issues with the tool. For example, we were unable to find the location where the value of k for the k-anonymity algorithm was defined, it was not possible to view data by equivalence class, and the data views gave the same record id every 60 records. There is an inability to import standard data files. The lack of documentation and support made using the tool difficult. We also found it quite buggy. While this may have been good to complete a Master's thesis project, it clearly lacked important functionality and robustness for broader use.

The sdMicro package cannot handle large data sets and will crash often. We've had a lot of problems working with it on our data sets. It is a decent tool for experimenting with de-identification techniques but is not suitable if you want to de-identify real data sets.

Note that de-identification tools are different from masking tools. The first attached document provides an overview of de-identification techniques and explains at some length the differences between these two approaches and when each is more suitable.

The second attached document is a report produced by Canada Health Infoway that contains an overview of de-identification techniques as well as a summary of the tools that are available on the market today.


 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Sun, 18 Oct 2009 00:00:00 -0400
What is a quasi-identifier? http://www.ehealthinformation.ca/knowledgebase/article/AA-00120 As noted in a different KnowledgeBase article (view here), the primary type of disclosure risk that needs to be focused on is identity disclosure. An underlying assumption for this type of risk is that there is an intruder who has two pieces of information: (a) the actual data set that has been disclosed and (b) some background information about one or more people in this data set. The background information is described by a set of variables. These variables are the quasi-identifiers.

Examples of common quasi-identifiers in the context of health information are: dates (such as, birth, death, admission, discharge, visit, and specimen collection), locations (such as, postal codes, hospital names, and regions), race, ethnicity, languages spoken, aboriginal status, profession, and gender.

This set of variables may expand if more data are collected about the public and more public registries are made available. For example, in many jurisdictions in the US the voter lists are publicly available. This means that the basic demographics (such as full ZIP code, date of birth, gender, all included in the voter list) are automatically considered quasi-identifiers. Individuals may also post personal information about themselves on web sites or announce that information to their friends. For example, many new parents announce the exact birth weight of their new child. When we examined birth registries, we found that weight, hospital of birth, and age of mother make most births unique. Therefore, weight is quite a powerful identifier.

It is simply wrong to make a general statement that all variables in a disclosed data set are quasi-identifiers. The reason is that it is often not plausible for an intruder to gain background information about all variables in a data set. For example, using legitimate means in Canada it would be very difficult for an intruder to obtain all of the diagnosis codes for a patient and use these for re-identification. It is very difficult for an intruder to get a complete set of lab test result values and use these for re-identification.

One of the skills required in managing re-identification risk is to decide the plausible quasi-identifiers that need to be considered. This analysis looks at the jurisdiction and the types of information that is publicly available, and should use the additional guidance that we have provided in other KnowledgeBase articles, available here for prosecutor risk and available here for journalist risk.

Therefore, to summarize, a quasi-identifier is a piece of information that an intruder can get hold of about a specific target individual or about a large number of people through the following means:

  • Personal knowledge of the specific target person (e.g., a neighbor, co-worker, ex-spouse).
  • The specific target person is famous and there is information publicly available about them.
  • Publicly available registries (e.g., voter lists and court records) or the media (e.g., obituaries published in newspapers or on-line).
  • Information that individuals post about themselves on the Internet (e.g., information they post on social networking sites).
  • Information that individuals often disclose to a large number of people (e.g., their baby's birth weight or birth date).


It is also important to remember that it is possible to predict a quasi-identifier from another variable. In this case, both of the variables must be considered quasi-identifiers. There is no point protecting against a variable A but not variable B, and the intruder can easily predict A from B. Therefore, it is important to search for correlated variables in a data set. Examples of correlated variables are:

  • Date of birth of a baby and date of discharge from a hospital.
  • Date of death and date of an autopsy.
  • Weight at birth and weight of baby at discharge from a hospital.
  • Age and date of graduation.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Mon, 19 Oct 2009 00:00:00 -0400
The five levels of identifiability http://www.ehealthinformation.ca/knowledgebase/article/AA-00143 The concept of identifiability is critical to managing the privacy risks when collecting, using, and disclosing personal information. In this article we therefore present a framework to reason about this concept, and then at the end some important implications are discussed.

It is important to distinguish between directly identifying variables, quasi-identifiers, and sensitive variables. This knowledgebase article provides some definitions.

If the disclosed database contains directly identifying variables then it is clearly personal information. However, if a database contains quasi-identifiers it can still be personal information. This is a very important point when managing risks from holding sensitive health information. The real examples provided in the table below show how individuals’ identities could be determined by using only the quasi-identifiers. In none of these examples was any directly identifying information included in the database, but it was still possible to determine the identity of at least one individual (and in some cases most of the individuals in the database).


ExampleDetails
AOL search data
AOL put anonymized Internet search data (including health-related searches) on its web site. New York Times reporters were able to re-identify an individual from her search records within a few days.
Chicago homicide database
Students were able to re-identify a significant percentage of individuals in the Chicago homicide database by linking with the social security death index.
Netflix movie recommendations
Individuals in an anonymized publicly available database of customer movie recommendations from Netflix are re-identified by linking their ratings with ratings in a publicly available Internet movie rating web site.
Re-identification of the medical record of the governor of MassachusettsData from the Group Insurance Commission, which purchases health insurance for state employees, was matched against the voter list for Cambridge, re-identifying the governor’s health insurance records.
Southern Illinoisan vs. The Department of Public HealthAn expert witness was able to re-identify with certainty 18 out of 20 individuals in a neuroblastoma data set from the Illinois cancer registry, and was able to suggest one of two alternative names for the remaining two individuals.
Canadian Adverse Drug Event Database
A national broadcaster aired a report on the death of a 26 year-old student taking a particular drug who was re-identified from the adverse drug reaction database released by Health Canada. The national broadcaster matched the information from the report to the publicly available obituaries for that area of Ontario.
Prescription and diagnosis records of a patient re-identifiedA neighbour re-identifies the records of a hospital patient by knowing her approximate age, gender, approximate admission date, and postal code.

The implication from this observation is that data may still be personal information even after the removal or obfuscation of the directly identifying variables.

Five Level Model

In a recent article we elaborated on a five-level model of data identifiability: http://www.computer.org/portal/web/csdl/doi/10.1109/MSP.2010.103. Here we provide a brief summary of that model - please see the full paper for the details.

We can use this model to understand different types of data and the risks they imply.

Level 1 pertains to information that is clearly identifiable. For example, a database containing names, SSNs and financial transaction information about individuals would be clearly Level 1 on our identifiability scale. At this level no real effort is needed to re-identify an individual. If we have someone's name and address, we know who they are.

Level 5 pertains to information that is clearly not identifiable. Aggregate information consists of counts. For example, a table showing that 25 people have died from H1N1 pandemic influenza in Canada in November would be considered aggregate data.

Masked data (Level 2) has had some manipulations done to the identifying variables. However, with masked data nothing has been done to obfuscate the quasi-identifiers. A more detailed discussion of masking methods is provided here: http://www.ehealthinformation.ca/documents/DeidTechniques.pdf. Because the quasi-identifiers are not touched in Level 2 data, this is effectively still personal information.

The difference between Level 2 and Level 3 is that in the latter the organization attempts to obfuscate the quasi-identifiers as well as the identifying variables. Level 3 data is very common. This level exists because many organizations do not use sound means to de-identify the quasi-identifiers, and therefore they do a poor job at it. For example, in a Canadian context, the date of birth and full postal code uniquely identify many Canadians living in urban areas, making that combination of quasi-identifiers very high risk for re-identifying individuals. Reducing the precision of the postal code to a five character postal code does not actually reduce the risk of re-identification, but is quite a common practice. We termed this level “Exposed” because the organization may believe that they have de-identified the data, but in fact the risk of re-identification is still high. Therefore, the data custodian has a high risk exposure and may not know it (i.e., the data custodian will not have put in place any controls to mitigate those risks).

It should also be noted that re-identifying "Exposed" data does not indicate anything about methods for de-identification. In fact, one would expect that "Exposed" data would be quite easy to re-identify.

With level 4 data an objective assessment of the re-identification risk is performed and a data custodian can substantiate claims that the data is properly de-identified. Level 4 data can be microdata or in tabular form; the same point about objective risk assessment applies for tabular data. It is only at level 4 that data moves from being personal information to not being personal information. Level 4 data is called "Managed" because the risk of re-identification is managed by the data custodian. The risk of re-identification may vary across organizations that disclose Level 4 data, but in all cases it is managed. This means that the custodian knows what the risk of re-identification is objectively (i.e., the custodian has measured it), and has taken reasonable actions to manage that risk.

Data at level 4 and level 5 present the least risk to the organization because a strong case can be made that this is not personal information any more.

Re-identification, Effort, and Skill

As data moves up the scale more effort is needed to re-identify them. Therefore, even though level 2 and level 3 data are still considered personal information and the risk of re-identification would be considered quite high, the amount of effort to re-identify an individual is also higher as data moves up.




For example, consider a data set with full name and address, and the date of birth. This would be a level 1 data set. The level 2 version of this data set that has had its identifying variables masked, where all we have left is a postal code and date of birth. With some effort those two demographics variables can be linked to a person (e.g., a full name and address). This re-identification is more effort than the level 1 version of that data where we already had the name and address. Similarly, level 3 data requires more effort to re-identify than level 2 data.

As a corollary, higher level data also takes more expertise and skills in re-identification to re-identify than lower level data. Therefore, a lay person can re-identify a level 1 data, but it is quite likely that an intruder skilled at re-identification would be needed to re-identify data at levels 2 and 3. However, with effort and skill the probability of successful re-identification is high.

For data at level 4 and level 5, even an intruder with significant effort and skill will have a lower probability of re-identifying individuals in the disclosed data set. This is represented in the graph above as a disproportionate increase in resources and skills needed by an intruder to re-identify that kind of data. It is not impossible, but the pre-requisites have increased dramatically.

Furthermore, level 4 and 5 data have a built-in disincentive for an intruder in that it is not worth it for them to invest their skills and time into re-identification if the probability of a successful match is low. The value of the re-identified data, whether economic, political, notoriety, or based on some other criterion, would have to be quite high for an intruder to invest the necessary resources to attack a level 4 or level 5 data set. One consequence of this is that few people would be able and willing to re-identify a level 4 or 5 data set. If a level 4 or level 5 data set is lost or stolen, it is not an automatic consequence that an attempt will be made to re-identify the data.

Another important consideration when we talk about re-identification effort is whether we are concerned about the re-identification of a single individual or many individuals in the disclosed data. Of course, the re-identification of many individuals will require more effort than a single individual. Therefore, when we talk about re-identification effort, we mean a normalized effort (i.e., effort to re-identify a single individual) so that the five level model applies irrespective of the data set size.

Example

As a concrete example, let us consider a hypothetical clinical data set with the following variables: first name, last name, health insurance number, street address, six character postal code, date of birth, date of doctor’s visit, and whether the individual has a sexually transmitted disease. In the five level framework, the data sets would be:


Level 1. The full data set as is.

Level 2. The names are replaced by fake names, the health insurance number is replaced with a fake number, and the street address field is removed altogether.

Level 3. The data set at Level 2 also has the postal code generalized from six characters to five characters. The risk at Level 3 is the same as Level 2, but the organization believes it has de-identified the data and discloses it. Therefore, the organization is exposed.

Level 4. The data set at Level 3 is further modified by replacing the 5 digit postal code with a single character postal code, the date of birth is replaced by age, and the date of visit is replaced by the month of the visit. A re-identification risk assessment is then performed on this data set and the risk was found to be below a pre-specified threshold.

Level 5. The number of individuals with a sexually transmitted disease.


In this example, the data in the last two levels would be considered de-identified data, but the first three would still be personal information.

Data Protection

Arguably, it would require less investment and resources to protect levels 4-5 data compared to levels 1-3 data. This is one of the key advantages, from an organizational perspective, of de-identification.


Acknowledgements: the original idea for this model came out of discussions with Craig Earle of the Ontario Institute for Cancer Research, and this is an adaptation & extension of a model he presented at the PHIPA conference in Toronto in the Fall of 2009.


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Tue, 29 Dec 2009 23:00:00 -0500
Is there a secondary use market for health information? http://www.ehealthinformation.ca/knowledgebase/article/AA-00103
An issue that has occasionally come up is whether there is a secondary use market for health information? Of course secondary use has been occurring for many years in the context of research, quality improvement, and public health. But does the data have commercial value? PriceWaterhouseCoopers has just published a report (attached) which describes the market for health information.

Based on a survey, the report notes that "Across the board, the vast majority (over 80 percent) of survey respondents cited privacy, legal implications, and public relations ramifications as concerns" for secondary use of health information, but these issues are seen as solvable problems rather than insurmountable problems.

 


The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Tue, 13 Oct 2009 00:00:00 -0400
Why can't we just add noise to the data to de-identify it? http://www.ehealthinformation.ca/knowledgebase/article/AA-00130 A method that is sometimes used to de-identify data sets is to add noise to the values of the variables. For example, a random number of days are added to a date of birth to create a perturbed date of birth.  You can also add noise to location data by moving a postal code to a randomly selected adjacent postal code.

In practice we have found that the data recipients do not like this approach to de-identification because they cannot trust the data anymore. For example, if we have a 50 year old male with cancer, it would not be known whether he was really fifty years old or 55 years old or 45 years old. The shift in age may make a difference in the analysis and in the conclusions drawn. Data recipients are concerned about drawing incorrect conclusions from the data because of perturbation.

The same mistrust issues come up with another technique called "microaggregation". The basic idea here is to identify a cluster of similar records in the data set and then replace the actual values with the average (or median) of that cluster. For example, the age would be replaced with the average age of the cluster. This is similar to the approach called "hot-deck imputation" that is used to deal with missing data. Again, the data recipients' reaction has been that they cannot trust the values. If a cluster inadvertently contains an outlier or an influential observation then the average may be distorted excessively, and potentially incorrect or inaccurate conclusions drawn.

Along the same lines as above, this is the reason why we have found data recipients and analysts reluctant to use synthetic data. In principle, synthetic data is not real data and therefore there are no identity disclosure risks with releasing it. Also, in principle the basic (bivariate) correlational structure of the data is maintained in the synthetic data. But if an analysis is complex, the distributions are non-standard, and the multivariate correlations structure is not captured in the synthetic version, then some relationships may not be detected or incorrectly detected in the synthetic version.

The approach that is used more often, at least in the context of health data sets, is to generalize the variables. So say a cancer patient born on 1st January 1959 may be generalized to just January 1959, or even just 1959. However, that number is still true, but has less precision. Therefore, the data can be trusted and the risk of drawing incorrect conclusions is reduced.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Sat, 31 Oct 2009 00:00:00 -0400
Which type of threshold should we use for de-identification? http://www.ehealthinformation.ca/knowledgebase/article/AA-00105

Many types of thresholds have been suggested and used for deciding when a data set is de-identified. Some common ones are:

  • Cell size of 5, 3, or 10
  • Uniqueness
  • Rareness
     

A question that comes up in practice is "which threshold should we use?".

 

In fact, all three of these are related. The general rule is:

 

 X% of the records are in cell sizes >= k (or equivalence classes of size k)

A common instantiation, called 5-anonymity is:

 

 100% of the records are in cell sizes >= 5

 

This means that every possible value on the quasi-identifiers occurs at least five times.

 

The uniqueness criterion can be stated as 2-anonymity:

  

  100% of the records are in cell sizes >= 2

 

Although, there are cases where 95% and 80% are acceptable values for X.

 

For example, some cancer registries release their data to researchers if less than 20% of their records are unique, and to the public if less than 5% of their records are unique.

 

The third criterion, rareness, means one has to ensure that there are no rare records. The general rule here is:

 

 all equivalence classes have >X% of the records in the population

 

This rule ensures that there are no equivalence classes that are relatively rare. Rareness is often defined in terms of the population not in terms of the records in the data set.

 

For example, some national statistical agencies will not disclose census information if any equivalence classes cover less than or equal to 0.5% of the population. This is the rule used to justify not releasing individual ages above 89 years because very few people live beyond that age (i.e., fewer than 0.5% of the population are in each of the 90+ age range).

 

The question is, which one of the above rules should be used and what values should be relied on? There are no hard rules on this, but a reasonable approach is to use precedent.

 

The argument for using precedent is that it signifies acceptability. If a particular rule has a lot of precedent then it suggests that society has accepted the level of risk implied by the rule. For example, there is a lot of precedent spanning multiple decades for the cell size of five rule, so it is safe to assume that this is a generally accepted level of risk.

 

Precedent may be specific to a certain type of data or registry. For example, some precedents may be more acceptable for the disclosure of cancer registry data, but may not be acceptable for sexually transmitted disease or mental health data. Also, of course, it will depend on who the data is being disclosed to.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Tue, 13 Oct 2009 00:00:00 -0400
Definition of identifiable dataset - if a person can find their record(s) in the dataset http://www.ehealthinformation.ca/knowledgebase/article/AA-00101 One question that sometimes comes up is whether a data set can be considered identifiable if a person can find their own record(s) in there.  This definition can be analyzed from a number of different perspectives.

A person may not know if they are in a data set if the data set is a sample. One example is if a data set is based on chart reviews from a random subset of patients at a clinic, then any randomly selected patient in that clinic will not necessarily know that they are in the data set created from the chart review. This uncertainty means that the above definition of identifiability may not be appropriate. One primary reason is that if a patient finds a record that matches their own characteristics they will not know if that record really belongs to them or to someone else.

As a caveat, if the clinic is small then the chances of another patient having exactly the same characteristics would also be small. Also, if the number of variables extracted from the charts is large, then it is less likely that there would be another patient at the clinic who is similar on all of the variables.

Another scenario is if a patient does know that s/he is in a data set. There are a couple of ways that a patient can know that their record is in that data set:

  • If the patient knows that they are unique in the population, and they find a match in the chart review sample, then they would be confident that their own record has been discovered. But this assumes that the patient knows that they are unique in the population. There are some circumstances where that knowledge is reasonable. For example, Canadians living in urban areas are unique on their date of birth and residence postal code. Therefore, a patient can be confident that if they use these two variables and s/he found a match in the chart review sample, then it is almost certain that it is him/her.
  • If the data set is not a sample but a whole population, say as in a population registry, then the patient would know for sure that they are in the data set, for example, if the data set is a provincial cancer registry. If the patient finds a single record that matches then s/he will know for sure that it was his/her record. If the patient is unique, then the patient will know for sure that s/he has discovered his/her own record.

 

Let us assume that one of the above two conditions is true. In that case, is the fact that the patient can find their own record a workable definition of an identifiable data set?

If we accept the above definition then we are setting a high standard. A common way we model an intruder is to consider the kind of background knowledge the intruder would have about the target person (or persons) being re-identified. The more background information the intruder has the greater the re-identification risk. A person will have the maximum possible background information about themselves (i.e., if the intruder is also the target person being re-identified); much more than any other intruder would know. It is true that many people tell their friends and family many things, but they do not tell them absolutely everything. Therefore, the background knowledge of a person about themselves represents the maximum possible background information and therefore the maximum possible risk. If one wants to be conservative, then this is a good approach. But in many cases assuming that an intruder will know absolutely everything does not seem very plausible and sets quite a high standard. In fact, the standard would be so high that we would not be able to share any information at all unless:

  • the data set disclosed is a random sample so that an individual would not know if their record is within the data set (i.e., no population registry could be considered de-identified almost by definition),
  • the sample data set does not include many variables so that there would be other individuals with the same characteristics in the population (e.g., the clinic example mentioned above), and
  • the underlying population is large enough that the chances of an individual being unique are quite small.


A counterargument that can be made is that people are now voluntarily (and involuntarily through their friends and colleagues) revealing more and more about themselves on their blogs, Facebook pages, and Tweets. This is certainly the case and more and more is being revealed every day. Whether this type of self-exposure of personal information amounts to individuals revealing everything about themselves such that an intruder has the same background knowledge as the person themselves remain an empirical question. Although it is easy to argue that we have not quite reached that point yet.

Another scenario to consider is when the following two conditions are met:

  • the data set has some quasi-identifiers and some sensitive information (an example of the quasi-identifiers would be the demographics),
  • there are only two individuals in the data set that have exactly the same values on the quasi-identifiers,
  • one of those individuals, say Bob, gets the data set, and
  • Bob knows the second person who has the same characteristics, Joe.


Under these conditions, Bob would discover the sensitive information about Joe with certainty. Therefore, re-identifying one's own record resulted in the disclosure of sensitive information about another individual.

The approach we have taken is to define plausible intruders (or archetypes of intruders) and assess what type of background knowledge they would have. The three we consider are: a neighbor, an ex-spouse, and a reporter. The reason we selected these three intruders is because all of the re-identifications that have actually happened and that have been publicly acknowledged have been done by researchers, reporters, or in court cases. All three acknowledged types of intruders can use publicly available information (e.g., in public registries). All three acknowledged types of intruders can talk to neighbors or ex-spouses to get additional information. Therefore, by focusing on these three types of intruders we are addressing plausible risks that we know have happened.

Furthermore, we always ensure that there are always more than two records with the same values on the quasi-identifiers. That way the re-identification of one's own record does not facilitate the discovery of new information about someone else.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

 

]]>
Mon, 12 Oct 2009 00:00:00 -0400
What are the different types of disclosure risk? http://www.ehealthinformation.ca/knowledgebase/article/AA-00119

There are two general kinds of re-identification risk that are of concern. The first is when an intruder can assign an identity to any record in the disclosed database. For example, the intruder would be able to determine that record number 7 in the disclosed database belongs to patient Alice Smith. This is called identity disclosure. The second type of re-identification is when an intruder learns something new about a patient in the database without knowing which specific record belongs to that patient. For example, if patients from a particular area in the emergency database had a certain test result, then an intruder does not need to know which record belongs to Alice Smith, if she lives in that particular area then the intruder will discover sensitive information about her. This is called attribute disclosure.


All known examples of re-identification are identity disclosures. Therefore, a strong case can be made that this is the type of disclosure risk that we should focus on first, and should try to manage.

There are three sub-types of risk under identity-disclosure: (a) prosecutor risk, (b) journalist risk, and (c) marketer risk. Another KnowledgeBase article explains the difference between prosecutor and journalist risk: [view here].

k-Anonymity algorithms are often used to manage the risk of re-identification for prosecutor and journalist risk. 



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

]]>
Mon, 19 Oct 2009 00:00:00 -0400
Research Ethics Board Wizard - Re-identification risk assessment without data http://www.ehealthinformation.ca/knowledgebase/article/AA-00172 Research Ethics Boards (REBs) often have to make decisions about re-identification risk before any data is collected. For many REBs the majority of their protocols are not "secondary use" protocols whereby a database exists and the investigator wishes to analyze that data. Rather, many are prospective studies where new data will be collected. Traditional re-identification risk assessment tools and de-identification tools could not handle that situation because they required the data to already exist - until now.

The REB Wizard tool that is illustrated in this video provides REBs the capability to assess re-identification risk by just describing the fields that will be collected and which part of the country (REB Wizard only exists for Canada at this point) the data will be collected from. Based on extensive analysis of the Canadian census, we have constructed models that would then provide an estimate of the percentage of the population that is at high risk of re-identification. In this case re-identification risk is measured in terms of uniqueness.





The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.


]]>
Tue, 05 Oct 2010 00:00:00 -0400
Are Canadians identifiable by their age, gender, and residence forward sortation area ? http://www.ehealthinformation.ca/knowledgebase/article/AA-00117 For many studies the combination of age, gender, and residence Forward Sortation Area (FSA) are collected. Also, in many datasets that are disclosed these three variables are included. Does that represent a privacy risk?

In one of our studies we analyzed the Canadian census (2001), and one of the questions that we attempted to answer was this one. The study is available from here: http://www.jamia.org/cgi/content/abstract/16/2/256

Our conclusion was that only a small proportion of Canadians are unique on these three variables (we use uniqueness as a measure of re-identification risk; for a discussion of this issue see this KnowledgeBase post: [view here]). There is variation across the country, with the largest percentage that is unique in New Brunswick. Here is a table showing the percentage of the population unique on these three variables:

Province Percentage of the Population Uniques
 Alberta  16%
 British Columbia  13%
 Manitoba  12%
 New Brunswick  49%
 Newfoundland  17%
 Nova Scotia  18%
 Ontario  9%
 PEI  10%
 Quebec  16%
 Saskatchewan  7%

An important question is whether or not these numbers are too high? Also, note as a caveat that these are estimates of uniqueness.

By most standards these numbers would be considered as high. One solution is to generalize the age into two, five, or ten year intervals, for example, but keep the FSA intact. The percentage of the population unique under these two modifications is as follows:

Province Percentage of Population Uniques with 2 Year Age Interval Percentage of Population Uniques with 5 Year Age Interval  Percentage of Population Uniques with 10 Year Age Interval
 Alberta  8%  4%  2%
 British Columbia  7%  1%  1%
 Manitoba  8%  5%  2%
 New Brunswick  41%  30%  25%
 Newfoundland  9%  5%  2%
 Nova Scotia  14%  7%  5%
 Ontario  4%  2%  1%
 PEI  3%  3%  3%
 Quebec  9%  4%  1%
 Saskatchewan  3%  3%  2%

Based on these results, we can say with some confidence that uniqueness is quite low with 10 year age intervals when age and FSA are also collected/disclosed. The exception is New Brunswick where uniqueness remains quite high even at a 10 year age interval. In instances where the custodian is comfortable (for example, because other actions are taken to manage re-identification risk) with the percentage of uniques for the 5 year age interval or even the 2 year age interval, then a custodian may disclose data on that basis. This recommendation is based on best available evidence today, and the percent uniques presented above should be seen as ceiling values on risk (i.e., they are conservative values). Using them means that you are being extra cautious.

Also note that there is considerable active research on this issue. Therefore, it is plausible that we will provide more updated and precise guidance on the disclosure of these demographics in the future.



The author(s) retain all copyright to this knowledgebase article. Please include a citation to the web page if you reuse this material. More information is available at our lab web site: http://www.ehealthinformation.ca/.

  ]]>
Sat, 17 Oct 2009 00:00:00 -0400