DIMACS Short Course: Statistical De-identification of Confidential Health Data with
Application to the HIPAA Privacy Regulations

DIMACS Center, CoRE Building, Rutgers University

Larry Cox, ljtcox at aol.com
Daniel Barth-Jones, Wayne State University, dbjones@med.wayne.edu
Presented under the auspices of the Special Focus on Communication Security and Information Privacy and
Special Focus on Computational and Mathematical Epidemiology.

This DIMACS short course will provide researchers, analysts and managers with an overview of the federal HIPAA Privacy regulations and an introduction to the principles and methods of statistical disclosure limitation that can be used to statistically de-identify healthcare data to meet privacy regulations.


The Health Insurance Portability and Accountability Act of 1996 (HIPAA) established the Standards for the Privacy of Individually Identifiable Health Information (i.e., HIPAA Privacy Rule), which provides privacy protections for the personal health information (PHI) of individuals. These federal regulations became effective April 14, 2003 and have wide reaching implications for many important uses of healthcare information.

Prior to the implementation of the privacy rule, epidemiologic, healthcare systems and other types of biomedical research had been routinely conducted with administrative healthcare data, with such analyses demonstrating considerable utility and value. The recent implementation of the HIPAA privacy standards, however, has necessitated dramatic changes in the process of conducting many analyses with administrative data. The privacy rule "safe-harbor" provision requires the removal of 18 types of identifying information before the resulting "de-identified" data can be used without restriction. This safe-harbor approach necessitates the removal of specific dates of patient care and lower level geographic information (such as 5 digit zip codes), which can greatly diminish the utility of such data for many analytic purposes. An alternative approach permitted under the privacy rule is the "statistical de-identification" of PHI certified by an expert statistician. Conducting analyses with statistically de-identified healthcare data is an attractive option because such data can be used without privacy rule restrictions.

In order for data to be considered statistically de-identified, "statistical disclosure" analyses must be conducted and documented which determine that the re-identification risks for the data are "very small". The principles and methods of statistical disclosure analysis and disclosure limitation address the risk that persons might be identifiable from information about them in data sets and provide a variety of methods by which risks of disclosure can be measured and reduced to acceptably low levels.

Course Objectives

This two-and-a-half day short course will provide participants with a detailed overview of the HIPAA privacy regulations, theory and methods for statistical disclosure limitation, and applied experience with disclosure limitation methods. Participants completing the course should be able to: 1) understand the permissible uses of healthcare data for various purposes under the HIPAA regulations; 2) conceptualize and document data intrusion scenarios; 3) conduct and document statistical disclosure analyses measuring disclosure risks; 4) select and use appropriate disclosure limitation methods; 5) evaluate the associated trade-offs between disclosure risks and statistical information quality. Development of these skills should enable participants to supervise and work successfully with an expert certifying statistician.

Participants will learn about statistical disclosure for both tabular data sets and microdata files, but the primary focus will be on statistical disclosure for microdata in healthcare databases. While statistical disclosure theory will be covered in some detail, the course orientation will be practical and applied, focusing primarily on providing participants with the knowledge and experience needed to statistically de-identify healthcare datasets in accordance with the HIPAA privacy rule and to identify confidentiality problems of potential concern. Upon completion of the course, it is expected that participants would be able to implement or supervise the implementation of basic disclosure limitation analyses and methods on their own and would be prepared to undertake further learning in statistical disclosure on their own.

Participants will be provided with lecture slides, classroom notes, and simulated example datasets. The course will include hands-on computer-based instruction in conducting disclosure analyses and implementing disclosure control methods.

Who Should Attend

Researchers (epidemiologists, biostatisticians, medical informatics and health systems scientists, etc.), analytic professionals (from business, marketing, pharmaceutical industry, etc.) and the managers who supervise staff in these fields will benefit from this short course. Technical and management personnel in the pharmaceutical and healthcare information industries will find the course particularly useful. Participants should have some prior background in mathematics, statistics, and data/information management. Knowledge of SAS statistical software will be desirable for the in-class computer instruction, but participants with experience in other statistical packages (SPSS, etc.) should also be able to complete the computer instruction portions of the class.

Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on February 23, 2005.