An in-depth course on anonymizing data intended for practitioners looking to apply the best guidelines and strategies in the de-identification of personally identifiable information. This course will summarize the case for de-identifying data (beyond simply masking of direct identifiers), explain disclosure risks, describe methods to properly de-identify personally identifiable data using a risk-based strategy that incorporates contractual and security controls, and provide guidance on an appropriate governance framework for the handling of personal information.
The course contains twelve e-learning modules and runs approximately 5 hours in length.
This module provides an outline of the course content and introduces some key overarching concepts.
Anita Fineberg, a legal expert in data privacy, outlines the legal framework in Canada, and contrasts this with international laws. The distinction between primary and secondary purposes is discussed, including topics such as permitted uses and disclosures, consent, and how anonymization fits into the framework.
It is important to appreciate disclosure risks, the details surrounding well known re-identification attacks (e.g., AOL, Netflix Prize), and how risk can be reasonably managed in a framework that includes data sharing agreements, security and privacy practices. This module will also introduce guidelines and strategies in the sharing of anonymized data.
In order to anonymize personal information, one first needs to understand what elements in the data are identifying, and the difference between direct and indirect (quasi-) identifiers. This module provides examples of identifiers and outlines a process that can be followed to determine when a field is identifying.
This module outlines some common data masking techniques that are defensibly used in practice and how they are applied to direct identifiers.
This module focuses on data risk estimation, assuming an attack (in public or non-public data releases), and the algorithms that are well established in this area. Risk thresholds for public data releases will also be discussed.
In this module, the plausible attacks that need to be evaluated are introduced, as well as how to model these attacks. These models will consider security controls and contractual obligations, as well as the use of expert probabilities in estimating risk. Risk thresholds for non-public data releases will also be discussed.
Once the re-identification risk is determined, and an acceptable risk threshold is set, de-identification techniques can be applied to reduce the risk to an acceptable level while preserving analytic utility. This module discusses common de-identification techniques that are defensibly used in practice.
This module discusses special considerations in the de-identification of longitudinal data in order to ensure analytic utility while tracking patients longitudinally.
Free-form text fields can contain both direct and quasi-identifiers that need to be masked or de-identified, respectively. This module discusses the unique challenges presented by unstructured data that require the use of different guidelines and methods than those applied to structured data.
This module outlines considerations in establishing a governance program and ensuing policies. A governance framework and policies ensure that the steps outlined in the previous modules are implemented in a repeatable and consistent way, and such that practices remain current across an organization.
This module recaps some of the important concepts presented in the course and provides an example of how data anonymization can be used in practice.
Funding for the development of this course was provided by the Office of the Privacy Commissioner of Canada (OPC) Contributions Program. The views expressed herein are those of the presenters and do not necessarily reflect those of the OPC.