Privacy-preserving data mining : models and algorithms

cover image

Where to find it

Information & Library Science Library

Call Number
QA76.9.D343 P75 2008
Status
Available

Authors, etc.

Names:

Summary

Advances in hardware technology have increased the capability to store and record personal data about consumers and individuals, causing concerns that personal data may be used for a variety of intrusive or malicious purposes.

Privacy-Preserving Data Mining: Models and Algorithms proposes a number of techniques to perform the data mining tasks in a privacy-preserving way. These techniques generally fall into the following categories: data modification techniques, cryptographic methods and protocols for data sharing, statistical techniques for disclosure and inference control, query auditing methods, randomization and perturbation-based techniques.

This edited volume contains surveys by distinguished researchers in the privacy field. Each survey includes the key research content as well as future research directions.

Privacy-Preserving Data Mining: Models and Algorithms is designed for researchers, professors, and advanced-level students in computer science, and is also suitable for industry practitioners.

Contents

  • Preface p. v
  • List of Figures p. xvii
  • List of Tables p. xxi
  • An Introduction to Privacy-Preserving Data Mining p. 1 Charu C. Aggarwal and Philip S. Yu
  • 1.1 Introduction p. 1
  • 1.2 Privacy-Preserving Data Mining Algorithms p. 3
  • 1.3 Conclusions and Summary p. 7
  • References p. 8
  • 2 A General Survey of Privacy-Preserving Data Mining Models and Algorithms p. 11 Charu C. Aggarwal and Philip S. Yu
  • 2.1 Introduction p. 11
  • 2.2 The Randomization Method p. 13
  • 2.2.1 Privacy Quantification p. 15
  • 2.2.2 Adversarial Attacks on Randomization p. 18
  • 2.2.3 Randomization Methods for Data Streams p. 18
  • 2.2.4 Multiplicative Perturbations p. 19
  • 2.2.5 Data Swapping p. 19
  • 2.3 Group Based Anonymization p. 20
  • 2.3.1 The k-Anonymity Framework p. 20
  • 2.3.2 Personalized Privacy-Preservation p. 24
  • 2.3.3 Utility Based Privacy Preservation p. 24
  • 2.3.4 Sequential Releases p. 25
  • 2.3.5 The l-diversity Method p. 26
  • 2.3.6 The t-closeness Model p. 27
  • 2.3.7 Models for Text, Binary and String Data p. 27
  • 2.4 Distributed Privacy-Preserving Data Mining p. 28
  • 2.4.1 Distributed Algorithms over Horizontally Partitioned Data Sets p. 30
  • 2.4.2 Distributed Algorithms over Vertically Partitioned Data p. 31
  • 2.4.3 Distributed Algorithms for k-Anonymity p. 32
  • 2.5 Privacy-Preservation of Application Results p. 32
  • 2.5.1 Association Rule Hiding p. 33
  • 2.5.2 Downgrading Classifier Effectiveness p. 34
  • 2.5.3 Query Auditing and Inference Control p. 34
  • 2.6 Limitations of Privacy: The Curse of Dimensionality p. 37
  • 2.7 Applications of Privacy-Preserving Data Mining p. 38
  • 2.7.1 Medical Databases: The Scrub and Datafly Systems p. 39
  • 2.7.2 Bioterrorism Applications p. 40
  • 2.7.3 Homeland Security Applications p. 40
  • 2.7.4 Genomic Privacy p. 42
  • 2.8 Summary p. 43
  • References p. 43
  • 3 A Survey of Inference Control Methods for Privacy-Preserving Data Mining p. 53 Josep Domingo-Ferrer
  • 3.1 Introduction p. 54
  • 3.2 A classification of Microdata Protection Methods p. 55
  • 3.3 Perturbative Masking Methods p. 58
  • 3.3.1 Additive Noise p. 58
  • 3.3.2 Microaggregation p. 59
  • 3.3.3 Data Wapping and Rank Swapping p. 61
  • 3.3.4 Rounding p. 62
  • 3.3.5 Resampling p. 62
  • 3.3.6 PRAM p. 62
  • 3.3.7 MASSC p. 63
  • 3.4 Non-perturbative Masking Methods p. 63
  • 3.4.1 Sampling p. 64
  • 3.4.2 Global Recoding p. 64
  • 3.4.3 Top and Bottom Coding p. 65
  • 3.4.4 Local Suppression p. 65
  • 3.5 Synthetic Microdata Generation p. 65
  • 3.5.1 Synthetic Data by Multiple Imputation p. 65
  • 3.5.2 Synthetic Data by Bootstrap p. 66
  • 3.5.3 Synthetic Data by Latin Hypercube Sampling p. 66
  • 3.5.4 Partially Synthetic Data by Cholesky Decomposition p. 67
  • 3.5.5 Other Partially Synthetic and Hybrid Microdata Approaches p. 67
  • 3.5.6 Pros and Cons of Synthetic Microdata p. 68
  • 3.6 Trading off Information Loss and Disclosure Risk p. 69
  • 3.6.1 Score Construction p. 69
  • 3.6.2 R-U Maps p. 71
  • 3.6.3 k-anonymity p. 71
  • 3.7 Conclusions and Research Directions p. 72
  • References p. 73
  • 4 Measures of Anonymity p. 81 Suresh Venkatasubramanian
  • 4.1 Introduction p. 81
  • 4.1.1 What is Privacy? p. 82
  • 4.1.2 Data Anonymization Methods p. 83
  • 4.1.3 A Classification of Methods p. 84
  • 4.2 Statistical Measures of Anonymity p. 85
  • 4.2.1 Query Restriction p. 85
  • 4.2.2 Anonymity via Variance p. 85
  • 4.2.3 Anonymity via Multiplicity p. 86
  • 4.3 Probabilistic Measures of Anonymity p. 87
  • 4.3.1 Measures Based on Random Perturbation p. 87
  • 4.3.2 Measures Based on Generalization p. 90
  • 4.3.3 Utility vs Privacy p. 94
  • 4.4 Computational Measures of Anonymity p. 94
  • 4.4.1 Anonymity via Isolation p. 97
  • 4.5 Conclusions and New Directions p. 97
  • 4.5.1 New Directions p. 98
  • References p. 99
  • 5 k-Anonymous Data Mining: A Survey p. 105 V. Ciriani and S. De Capitani di Vimercati and S. Foresti and P. Samarati
  • 5.1 Introduction p. 105
  • 5.2 k-Anonymity p. 107
  • 5.3 Algorithms for Enforcing k-Anonymity p. 110
  • 5.4 k-Anonymity Threats from Data Mining p. 117
  • 5.4.1 Association Rules p. 118
  • 5.4.2 Classification Mining p. 118
  • 5.5 k-Anonymity in Data Mining p. 120
  • 5.6 Anonymize-and-Mine p. 123
  • 5.7 Mine-and-Anonymize p. 126
  • 5.7.1 Enforcing k-Anonymity on Association Rules p. 126
  • 5.7.2 Enforcing k-Anonymity on Decision Trees p. 130
  • 5.8 Conclusions p. 133
  • Acknowledgments p. 133
  • References p. 134
  • 6 A Survey of Randomization Methods for Privacy-Preserving Data Mining p. 137 Charu C. Aggarwal and Philip S. Yu
  • 6.1 Introduction p. 137
  • 6.2 Reconstruction Methods for Randomization p. 139
  • 6.2.1 The Bayes Reconstruction Method p. 139
  • 6.2.2 The EM Reconstruction Method p. 141
  • 6.2.3 Utility and Optimality of Randomization Models p. 143
  • 6.3 Applications of Randomization p. 144
  • 6.3.1 Privacy-Preserving Classification with Randomization p. 144
  • 6.3.2 Privacy-Preserving OLAP p. 145
  • 6.3.3 Collaborative Filtering p. 145
  • 6.4 The Privacy-Information Loss Tradeoff p. 146
  • 6.5 Vulnerabilities of the Randomization Method p. 149
  • 6.6 Randomization of Time Series Data Streams p. 151
  • 6.7 Multiplicative Noise for Randomization p. 152
  • 6.7.1 Vulnerabilities of Multiplicative Randomization p. 153
  • 6.7.2 Sketch Based Randomization p. 153
  • 6.8 Conclusions and Summary p. 154
  • References p. 154
  • 7 A Survey of Multiplicative Perturbation for Privacy-Preserving Data Mining p. 157 Keke Chen and Ling Liu
  • 7.1 Introduction p. 158
  • 7.1.1 Data Privacy vs. Data Utility p. 159
  • 7.1.2 Outline p. 160
  • 7.2 Definition of Multiplicative Perturbation p. 161
  • 7.2.1 Notations p. 161
  • 7.2.2 Rotation Perturbation p. 161
  • 7.2.3 Projection Perturbation p. 162
  • 7.2.4 Sketch-based Approach p. 164
  • 7.2.5 Geometric Perturbation p. 164
  • 7.3 Transformation Invariant Data Mining Models p. 165
  • 7.3.1 Definition of Transformation Invariant Models p. 166
  • 7.3.2 Transformation-Invariant Classification Models p. 166
  • 7.3.3 Transformation-Invariant Clustering Models p. 167
  • 7.4 Privacy Evaluation for Multiplicative Perturbation p. 168
  • 7.4.1 A Conceptual Multidimensional Privacy Evaluation Model p. 168
  • 7.4.2 Variance of Difference as Column Privacy Metric p. 169
  • 7.4.3 Incorporating Attack Evaluation p. 170
  • 7.4.4 Other Metrics p. 171
  • 7.5 Attack Resilient Multiplicative Perturbations p. 171
  • 7.5.1 Naive Estimation to Rotation Perturbation p. 171
  • 7.5.2 ICA-Based Attacks p. 173
  • 7.5.3 Distance-Inference Attacks p. 174
  • 7.5.4 Attacks with More Prior Knowledge p. 176
  • 7.5.5 Finding Attack-Resilient Perturbations p. 177
  • 7.6 Conclusion p. 177
  • Acknowledgment p. 178
  • References p. 179
  • 8 A Survey of Quantification of Privacy Preserving Data Mining Algorithms p. 183 Elisa Bertino and Dan Lin and Wei Jiang
  • 8.1 Introduction p. 184
  • 8.2 Metrics for Quantifying Privacy Level p. 186
  • 8.2.1 Data Privacy p. 186
  • 8.2.2 Result Privacy p. 191
  • 8.3 Metrics for Quantifying Hiding Failure p. 192
  • 8.4 Metrics for Quantifying Data Quality p. 193
  • 8.4.1 Quality of the Data Resulting from the PPDM Process p. 193
  • 8.4.2 Quality of the Data Mining Results p. 198
  • 8.5 Complexity Metrics p. 200
  • 8.6 How to Select a Proper Metric p. 201
  • 8.7 Conclusion and Research Directions p. 202
  • References p. 202
  • 9 A Survey of Utility-based Privacy-Preserving Data Transformation Methods p. 207 Ming Hua and Jian Pei
  • 9.1 Introduction p. 208
  • 9.1.1 What is Utility-based Privacy Preservation? p. 209
  • 9.2 Types of Utility-based Privacy Preservation Methods p. 210
  • 9.2.1 Privacy Models p. 210
  • 9.2.2 Utility Measures p. 212
  • 9.2.3 Summary of the Utility-Based Privacy Preserving Methods p. 214
  • 9.3 Utility-Based Anonymization Using Local Recoding p. 214
  • 9.3.1 Global Recoding and Local Recoding p. 215
  • 9.3.2 Utility Measure p. 216
  • 9.3.3 Anonymization Methods p. 217
  • 9.3.4 Summary and Discussion p. 219
  • 9.4 The Utility-based Privacy Preserving Methods in Classification Prob-lems p. 219
  • 9.4.1 The Top-Down Specialization Method p. 220
  • 9.4.2 The Progressive Disclosure Algorithm p. 224
  • 9.4.3 Summary and Discussion p. 228
  • 9.5 Anonymized Marginal: Injecting Utility into Anonymized Data Sets p. 228
  • 9.5.1 Anonymized Marginal p. 229
  • 9.5.2 Utility Measure p. 230
  • 9.5.3 Injecting Utility Using Anonymized Marginals p. 231
  • 9.5.4 Summary and Discussion p. 233
  • 9.6 Summary p. 234
  • Acknowledgments p. 234
  • References p. 234
  • 10 Mining Association Rules under Privacy Constraints p. 239 Jayant R. Haritsa
  • 10.1 Introduction p. 239
  • 10.2 Problem Framework p. 240
  • 10.2.1 Database Model p. 240
  • 10.2.2 Mining Objective p. 241
  • 10.2.3 Privacy Mechanisms p. 241
  • 10.2.4 Privacy Metric p. 243
  • 10.2.5 Accuracy Metric p. 245
  • 10.3 Evolution of the Literature p. 246
  • 10.4 The FRAPP Framework p. 251
  • 10.4.1 Reconstruction Model p. 252
  • 10.4.2 Estimation Error p. 253
  • 10.4.3 Randomizing the Perturbation Matrix p. 256
  • 10.4.4 Efficient Perturbation p. 256
  • 10.4.5 Integration with Association Rule Mining p. 258
  • 10.5 Sample Results p. 259
  • 10.6 Closing Remarks 263 Acknowledgments p. 263
  • References p. 263
  • 11 A Survey of Association Rule Hiding Methods for Privacy p. 267 Vassilios S. Verykios and Aris Gkoulalas-Divanis
  • 11.1 Introduction p. 267
  • 11.2 Terminology and Preliminaries p. 269
  • 11.3 Taxonomy of Association Rule Hiding Algorithms p. 270
  • 11.4 Classes of Association Rule Algorithms p. 271
  • 11.4.1 Heuristic Approaches p. 272
  • 11.4.2 Border-based Approaches p. 277
  • 11.4.3 Exact Approaches p. 278
  • 11.5 Other Hiding Approaches p. 279
  • 11.6 Metrics and Performance Analysis p. 281
  • 11.7 Discussion and Future Trends p. 284
  • 11.8 Conclusions 285 References p. 286
  • 12 A Survey of Statistical Approaches to Preserving Confidentiality of Contingency Table Entries p. 291 Stephen E. Fienberg and Aleksandra B. Slavkovic
  • 12.1 Introduction p. 291
  • 12.2 The Statistical Approach Privacy Protection p. 292
  • 12.3 Datamining Algorithms, Association Rules, and Disclosure Limitation p. 294
  • 12.4 Estimation and Disclosure Limitation for Multi-way Contingency Tables p. 295
  • 12.5 Two Illustrative Examples p. 301
  • 12.5.1 Example 1: Data from a Randomized Clinical Trial p. 301
  • 12.5.2 Example 2: Data from the 1993 U.S. Current Population Survey p. 305
  • 12.6 Conclusions p. 308
  • Acknowledgments p. 309
  • References p. 309
  • 13 A Survey of Privacy-Preserving Methods Across Horizontally Partitioned Data p. 313 Murat Kantarcioglu
  • 13.1 Introduction p. 313
  • 13.2 Basic Cryptographic Techniques for Privacy-Preserving Distributed Data Mining p. 315
  • 13.3 Common Secure Sub-protocols Used in Privacy-Preserving Distributed Data Mining p. 318
  • 13.4 Privacy-preserving Distributed Data Mining on Horizontally Partitioned Data p. 323
  • 13.5 Comparison to Vertically Partitioned Data Model p. 326
  • 13.6 Extension to Malicious Parties p. 327
  • 13.7 Limitations of the Cryptographic Techniques Used in Privacy-Preserving Distributed Data Mining p. 329
  • 13.8 Privacy Issues Related to Data Mining Results p. 330
  • 13.9 Conclusion p. 332
  • References p. 332
  • 14 A Survey of Privacy-Preserving Methods Across Vertically Partitioned Data p. 337 Jaideep Vaidya
  • 14.1 Introduction p. 337
  • 14.2 Classification p. 341
  • 14.2.1 Naïve Bayes Classification p. 342
  • 14.2.2 Bayesian Network Structure Learning p. 343
  • 14.2.3 Decision Tree Classification p. 344
  • 14.3 Clustering p. 346
  • 14.4 Association Rule Mining p. 347
  • 14.5 Outlier detection p. 349
  • 14.5.1 Algorithm p. 351
  • 14.5.2 Security Analysis p. 352
  • 14.5.3 Computation and Communication Analysis p. 354
  • 14.6 Challenges and Research Directions p. 355
  • References p. 356
  • 15 A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods p. 359 Kun Liu and Chris Giannella and Hillol Kargupta
  • 15.1 Introduction p. 360
  • 15.2 Definitions and Notation p. 360
  • 15.3 Attacking Additive Data Perturbation p. 361
  • 15.3.1 Eigen-Analysis and PCA Preliminaries p. 362
  • 15.3.2 Spectral Filtering p. 363
  • 15.3.3 SVD Filtering p. 364
  • 15.3.4 PCA Filtering p. 365
  • 15.3.5 MAP Estimation Attack p. 366
  • 15.3.6 Distribution Analysis Attack p. 367
  • 15.3.7 Summary p. 367
  • 15.4 Attacking Matrix Multiplicative Data Perturbation p. 369
  • 15.4.1 Known I/O Attacks p. 370
  • 15.4.2 Known Sample Attack p. 373
  • 15.4.3 Other Attacks Based on ICA p. 374
  • 15.4.4 Summary p. 375
  • 15.5 Attacking k-Anonymization p. 376
  • 15.6 Conclusion 376 Acknowledgments 377 References p. 377
  • 16 Private Data Analysis via Output Perturbation p. 383 Kobbi Nissim
  • 16.1 Introduction p. 383
  • 16.2 The Abstract Model - Statistical Databases, Queries, and Sanitizers p. 385
  • 16.3 Privacy p. 388
  • 16.3.1 Interpreting the Privacy Definition p. 390
  • 16.4 The Basic Technique: Calibrating Noise to Sensitivity p. 394
  • 16.4.1 Applications: Functions with Low Global Sensitivity p. 396
  • 16.5 Constructing Sanitizers for Complex Functionalities p. 400
  • 16.5.1 k-Means Clustering p. 401
  • 16.5.2 SVD and PCA p. 403
  • 16.5.3 Learning in the Statistical Queries Model p. 404
  • 16.6 Beyond the Basics p. 405
  • 16.6.1 Instance Based Noise and Smooth Sensitivity p. 406
  • 16.6.2 The Sample-Aggregate Framework p. 408
  • 16.6.3 A General Sanitization Mechanism p. 409
  • 16.7 Related Work and Bibliographic Notes p. 409
  • Acknowledgments p. 411
  • References p. 411
  • 17 A Survey of Query Auditing Techniques for Data Privacy p. 415 Shubha U. Nabar and Krishnaram Kenthapadi and Nina Mishra and Rajeev Motwani
  • 17.1 Introduction p. 415
  • 17.2 Auditing Aggregate Queries p. 416
  • 17.2.1 Offline Auditing p. 417
  • 17.2.2 Online Auditing p. 418
  • 17.3 Auditing Select-Project-Join Queries p. 426
  • 17.4 Challenges in Auditing p. 427
  • 17.5 Reading p. 429
  • References p. 430
  • 18 Privacy and the Dimensionality Curse p. 433 Charu C. Aggarwal
  • 18.1 Introduction p. 433
  • 18.2 The Dimensionality Curse and the k-anonymity Method p. 435
  • 18.3 The Dimensionality Curse and Condensation p. 441
  • 18.4 The Dimensionality Curse and the Randomization Method p. 446
  • 18.4.1 Effects of Public Information p. 446
  • 18.4.2 Effects of High Dimensionality p. 450
  • 18.4.3 Gaussian Perturbing Distribution p. 450
  • 18.4.4 Uniform Perturbing Distribution p. 455
  • 18.5 The Dimensionality Curse and l-diversity p. 458
  • 18.6 Conclusions and Research Directions p. 459
  • References p. 460
  • 19 Personalized Privacy Preservation p. 461 Yufei Tao and Xiaokui Xiao
  • 19.1 Introduction p. 461
  • 19.2 Formalization of Personalized Anonymity p. 463
  • 19.2.1 Personal Privacy Requirements p. 464
  • 19.2.2 Generalization p. 465
  • 19.3 Combinatorial Process of Privacy Attack p. 467
  • 19.3.1 Primary Case p. 468
  • 19.3.2 Non-primary Case p. 469
  • 19.4 Theoretical Foundation p. 470
  • 19.4.1 Notations and Basic Properties p. 471
  • 19.4.2 Derivation of the Breach Probability p. 472
  • 19.5 Generalization Algorithm p. 473
  • 19.5.1 The Greedy Framework p. 474
  • 19.5.2 Optimal SA-generalization p. 476
  • 19.6 Alternative Forms of Personalized Privacy Preservation p. 478
  • 19.6.1 Extension of k-anonymity p. 479
  • 19.6.2 Personalization in Location Privacy Protection p. 480
  • 19.7 Summary and Future Work p. 482
  • References p. 485
  • 20 Privacy-Preserving Data Stream Classification p. 487 Yabo Xu and Ke Wang and Ada Wai-Chee Fu and Rong She and Jian Pei
  • 20.1 Introduction p. 487
  • 20.1.1 Motivating Example p. 488
  • 20.1.2 Contributions and Paper Outline p. 490
  • 20.2 Related Works p. 491
  • 20.3 Problem Statement p. 493
  • 20.3.1 Secure Join Stream Classification p. 493
  • 20.3.2 Naive Bayesian Classifiers p. 494
  • 20.4 Our Approach p. 495
  • 20.4.1 Initialization p. 495
  • 20.4.2 Bottom-Up Propagation p. 496
  • 20.4.3 Top-Down Propagation p. 497
  • 20.4.4 Using NBC p. 499
  • 20.4.5 Algorithm Analysis p. 500
  • 20.5 Empirical Studies p. 501
  • 20.5.1 Real-life Datasets p. 502
  • 20.5.2 Synthetic Datasets p. 504
  • 20.5.3 Discussion p. 506
  • 20.6 Conclusions p. 507
  • References p. 508
  • Index p. 511

Subjects

Subject Headings A:

Other details