Anonymizing health data : case studies and methods to get you started

cover image

Where to find it

Law Library — 1st Floor Collection (1st floor)

Call Number
R853.S7 E44 2013
Status
Available

Authors, etc.

Names:

Summary

Updated as of August 2014, this practical book will demonstrate proven methods for anonymizing health data to help your organization share meaningful datasets, without exposing patient identity. Leading experts Khaled El Emam and Luk Arbuckle walk you through a risk-based methodology, using case studies from their efforts to de-identify hundreds of datasets.

Clinical data is valuable for research and other types of analytics, but making it anonymous without compromising data quality is tricky. This book demonstrates techniques for handling different data types, based on the authors' experiences with a maternal-child registry, inpatient discharge abstracts, health insurance claims, electronic medical record databases, and the World Trade Center disaster registry, among others.

Understand different methods for working with cross-sectional and longitudinal datasets Assess the risk of adversaries who attempt to re-identify patients in anonymized datasets Reduce the size and complexity of massive datasets without losing key information or jeopardizing privacy Use methods to anonymize unstructured free-form text data Minimize the risks inherent in geospatial data, without omitting critical location-based health information Look at ways to anonymize coding information in health data Learn the challenge of anonymously linking related datasets

Contents

  • Preface p. ix
  • 1 Introduction p. 1
  • To Anonymize or Not to Anonymize p. 1
  • Consent, or Anonymization? p. 2
  • Penny Pinching p. 3
  • People Are Private p. 4
  • The Two Pillars of Anonymization p. 4
  • Masking Standards p. 5
  • De-Identification Standards p. 5
  • Anonymization in the Wild p. 8
  • Organizational Readiness p. 8
  • Making it Practical p. 9
  • Use Cases p. 10
  • Stigmatizing Analytics p. 12
  • Anonymization in Other Domains p. 13
  • About This Book p. 15
  • 2 A Risk-Based De-Identification Methodology p. 19
  • Basic Principles p. 19
  • Steps in the De-Identification Methodology p. 21
  • Step 1 Selecting Direct and Indirect Identifiers p. 21
  • Step 2 Setting the Threshold p. 22
  • Step 3 Examining Plausible Attacks p. 23
  • Step 4 De-Identifying the Data p. 25
  • Step 5 Documenting the Process p. 26
  • Measuring Risk Under Plausible Attacks p. 26
  • T1 Deliberate Attempt at Re-Identification p. 26
  • T2 Inadvertent Attempt at Re-Identification p. 28
  • T3 Data Breach p. 29
  • T4 Public Data p. 30
  • Measuring Re- Identification Risk p. 30
  • Probability Metrics p. 30
  • Information Lou Metrics p. 32
  • Risk Thresholds p. 35
  • Choosing Threshold p. 35
  • Meeting Thresholds p. 38
  • Risky Business p. 39
  • 3 Cross-Sectional Data: Research Registries p. 43
  • Process Overview p. 43
  • Secondary Uses and Disclosures p. 43
  • Getting the Data p. 46
  • Formulating the Protocol p. 47
  • Negotiating with the Data Access Committee p. 48
  • BORN Ontario p. 49
  • BORN Data Set p. 50
  • Risk Assessment p. 51
  • Threat Modeling p. 51
  • Results p. 52
  • Year on Year: Reusing Risk Analyses p. 53
  • Final Thoughts p. 54
  • 4 Longitudinal Discharge Abstract Data: State Inpatient Databases p. 57
  • Longitudinal Data p. 58
  • Don't Treat It Like Cross-Sectional Data p. 60
  • De-Identifying Under Complete Knowledge p. 61
  • Approximate Complete Knowledge p. 63
  • Exact Complete Knowledge p. 64
  • Implementation p. 65
  • Generalization Under Complete Knowledge p. 65
  • The State Inpatient Database (SID) of California p. 66
  • The SID of California and Open Data p. 66
  • Risk Assessment p. 68
  • Threat Modeling p. 68
  • Results p. 68
  • Final Thoughts p. 69
  • 5 Dates, Long Tails, and Correlation: Insurance Claims Data p. 71
  • The Heritage Health Prize p. 71
  • Dale Generalization p. 72
  • Randomizing Dales Independently of One Another p. 72
  • Shifting the Sequence, Ignoring the Intervals p. 73
  • Generalizing Intervals to Maintain Order p. 74
  • Dates and Intervals and Back Again p. 76
  • A Different Anchor p. 77
  • Other Quasi-Identifiers p. 77
  • Connected Dates p. 78
  • Long Tails p. 78
  • The Risk from long Tails p. 79
  • Threat Modeling p. 80
  • Number of Claims to Truncate p. 80
  • Which Claims to Truncate p. 82
  • Correlation of Related Items p. 83
  • Expert Opinions p. 84
  • Predictive Models p. 85
  • Implications fur De-Identifying Data Sets p. 85
  • Final Thoughts p. 86
  • 6 Longitudinal Events Data: A Disaster Registry p. 89
  • Adversary Power p. 90
  • Keeping Power in Check p. 90
  • Power in Practice p. 91
  • A Sample of Power p. 92
  • The WTC Disaster Registry p. 94
  • Capturing Events p. 94
  • The WTC Data Set p. 95
  • The Power of Events p. 96
  • Risk Assessment p. 98
  • Threat Modeling p. 99
  • Results p. 99
  • Final Thoughts p. 99
  • 7 Data Reduction: Research Registry Revisited p. 101
  • The Subsampling Limbo p. 101
  • How Low Can We Go? p. 102
  • Not for All Types of Risk p. 102
  • BORN to limbo! p. 103
  • Many Quasi-Identifiers p. 104
  • Subsets of Quasi-Identifiers p. 105
  • Covering Designs p. 106
  • Covering BORN p. 108
  • Final Thoughts p. 109
  • 8 Free-Form Text: Electronic Medical Records p. 111
  • Not So Regular Expressions p. 111
  • General Approaches to Text Anonymization p. 112
  • Ways to Mark the Text as Anonymized p. 114
  • Evaluation Is Key p. 115
  • Appropriate Metrics, Strict but Fair p. 117
  • Standards for Recall, and a Risk-Based Approach p. 118
  • Standards for Precision p. 119
  • Anonymization Rules p. 120
  • Informatics for Integrating Biology and the Bedside (i2b2) p. 121
  • i2b2 Text Data Set p. 121
  • Risk Assessment p. 123
  • Threat Modeling p. 123
  • A Rule-Based System p. 124
  • Results p. 124
  • Final Thoughts p. 126
  • 9 Geospatial Aggregation: Dissemination Areas and ZIP Codes p. 129
  • Where the Wild Things Are p. 130
  • Being Good Neighbors p. 131
  • Distance Between Neighbor p. 131
  • Circle of Neighbors p. 132
  • Round Earth p. 134
  • Flat Earth p. 135
  • Clustering Neighbors p. 136
  • We All Have Boundaries p. 137
  • Fast Nearest Neighbor p. 138
  • Too Close to Home p. 140
  • Levels of Gcoproxy Attacks p. 141
  • Measuring Geoproxv Risk p. 142
  • Final Thoughts p. 144
  • 10 Medical Codes: A Hackathon p. 147
  • Codes in Practice p. 148
  • Generalization p. 149
  • The Digits of Diseases p. 149
  • The Digits of Procedures p. 151
  • The (Alpha)Digits of Drugs p. 151
  • Suppression p. 152
  • Shuffling p. 153
  • Final Thoughts p. 156
  • 11 Masking: Oncology Databases p. 159
  • Schema Shmema p. 159
  • Data in Disguise p. 160
  • Field Suppression p. 160
  • Randomization p. 161
  • Pseudonymization p. 163
  • Frequency of Pseudonyms p. 164
  • Masking On the Fly p. 165
  • Final Thoughts p. 166
  • 12 Secure Linking p. 167
  • Let's link Up p. 167
  • Doing It Securely p. 170
  • Don't Try This at Home p. 170
  • The Third-Party Problem p. 172
  • Basic Layout for Linking Up p. 173
  • The Nifty-Gritty Protocol for Linking Up p. 174
  • Bringing Paillier to the Parties p. 174
  • Matching on the Unknown p. 175
  • Scaling Up p. 177
  • Cuckoo Hashing p. 178
  • How last Does a Cuckoo Run? p. 179
  • Final Thoughts p. 179
  • 13 De-Identification and Data Quality. p. 181
  • Useful Data from Useful De-Identification p. 181
  • Degrees of Loss p. 182
  • Workload-Aware De Identification p. 183
  • Questions to Improve Data Utility p. 185
  • Final Thoughts p. 187
  • Index p. 191

Other details