e-Learning Course on Anonymizing Data


An in-depth course on anonymizing data intended for practitioners looking to apply the best guidelines and strategies in the de-identification of personally identifiable information. This course will summarize the case for de-identifying data (beyond simply masking of direct identifiers), explain disclosure risks, describe methods to properly de-identify personally identifiable data using a risk-based strategy that incorporates contractual and security controls, and provide guidance on an appropriate governance framework for the handling of personal information.

The course contains twelve e-learning modules and runs approximately 5 hours in length.

Module 1: Introduction

This module provides an outline of the course content and introduces some key overarching concepts.

Module 2: Framework to Data Privacy

Anita Fineberg, a legal expert in data privacy, outlines the legal framework in Canada, and contrasts this with international laws. The distinction between primary and secondary purposes is discussed, including topics such as permitted uses and disclosures, consent, and how anonymization fits into the framework.

Module 3: Understanding Disclosure Risk

It is important to appreciate disclosure risks, the details surrounding well known re-identification attacks (e.g., AOL, Netflix Prize), and how risk can be reasonably managed in a framework that includes data sharing agreements, security and privacy practices. This module will also introduce guidelines and strategies in the sharing of anonymized data.

Module 4: Classification of Identifiers

In order to anonymize personal information, one first needs to understand what elements in the data are identifying, and the difference between direct and indirect (quasi-) identifiers. This module provides examples of identifiers and outlines a process that can be followed to determine when a field is identifying.

Module 5: Data Masking

This module outlines some common data masking techniques that are defensibly used in practice and how they are applied to direct identifiers.

Module 6: Data Risk

This module focuses on data risk estimation, assuming an attack (in public or non-public data releases), and the algorithms that are well established in this area. Risk thresholds for public data releases will also be discussed.

Module 7: Contextual Risk

In this module, the plausible attacks that need to be evaluated are introduced, as well as how to model these attacks. These models will consider security controls and contractual obligations, as well as the use of expert probabilities in estimating risk. Risk thresholds for non-public data releases will also be discussed.

Module 8: De-identification of cross-sectional data

Once the re-identification risk is determined, and an acceptable risk threshold is set, de-identification techniques can be applied to reduce the risk to an acceptable level while preserving analytic utility. This module discusses common de-identification techniques that are defensibly used in practice.

Module 9: De-identification of longitudinal data

This module discusses special considerations in the de-identification of longitudinal data in order to ensure analytic utility while tracking patients longitudinally.

Module 10: Anonymizing unstructured data

Free-form text fields can contain both direct and quasi-identifiers that need to be masked or de-identified, respectively. This module discusses the unique challenges presented by unstructured data that require the use of different guidelines and methods than those applied to structured data.

Module 11: Governance and policies

This module outlines considerations in establishing a governance program and ensuing policies. A governance framework and policies ensure that the steps outlined in the previous modules are implemented in a repeatable and consistent way, and such that practices remain current across an organization.

Module 12: Conclusion

This module recaps some of the important concepts presented in the course and provides an example of how data anonymization can be used in practice.

Funding for the development of this course was provided by the Office of the Privacy Commissioner of Canada (OPC) Contributions Program. The views expressed herein are those of the presenters and do not necessarily reflect those of the OPC.