Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Weiyi Xia¹, Raymond Heatherly², Xiaofeng Ding³, Jiuyong Li³, Bradley Malin²

Affiliations

¹ EECS Dept., Vanderbilt University, Nashville, TN, USA.
² Biomedical Informatics Dept., Vanderbilt University, Nashville, TN, USA.
³ Computer Science Dept., University of South Australia, Mawson Lakes, SA, Australia.

PMID: 25520961
PMCID: PMC4266184
DOI: 10.1145/2435349.2435357

Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Weiyi Xia et al. CODASPY. 2013.

. 2013:2013:59-70.

doi: 10.1145/2435349.2435357.

Authors

Weiyi Xia¹, Raymond Heatherly², Xiaofeng Ding³, Jiuyong Li³, Bradley Malin²

Affiliations

¹ EECS Dept., Vanderbilt University, Nashville, TN, USA.
² Biomedical Informatics Dept., Vanderbilt University, Nashville, TN, USA.
³ Computer Science Dept., University of South Australia, Mawson Lakes, SA, Australia.

PMID: 25520961
PMCID: PMC4266184
DOI: 10.1145/2435349.2435357

Abstract

Modern information technologies enable organizations to capture large quantities of person-specific data while providing routine services. Many organizations hope, or are legally required, to share such data for secondary purposes (e.g., validation of research findings) in a de-identified manner. In previous work, it was shown de-identification policy alternatives could be modeled on a lattice, which could be searched for policies that met a prespecified risk threshold (e.g., likelihood of re-identification). However, the search was limited in several ways. First, its definition of utility was syntactic - based on the level of the lattice - and not semantic - based on the actual changes induced in the resulting data. Second, the threshold may not be known in advance. The goal of this work is to build the optimal set of policies that trade-off between privacy risk (R) and utility (U), which we refer to as a R-U frontier. To model this problem, we introduce a semantic definition of utility, based on information theory, that is compatible with the lattice representation of policies. To solve the problem, we initially build a set of policies that define a frontier. We then use a probability-guided heuristic to search the lattice for policies likely to update the frontier. To demonstrate the effectiveness of our approach, we perform an empirical analysis with the Adult dataset of the UCI Machine Learning Repository. We show that our approach can construct a frontier closer to optimal than competitive approaches by searching a smaller number of policies. In addition, we show that a frequently followed de-identification policy (i.e., the Safe Harbor standard of the HIPAA Privacy Rule) is suboptimal in comparison to the frontier discovered by our approach.

Keywords: De-identification; Experimentation; Management; Optimization; Policy; Privacy; Security.

PubMed Disclaimer

Figures

**Figure 1. An example of a de-identification policy defined over three quasi-identifying attributes, {*Age*, *Gender*, *ZIP*}**

**Figure 2. An example of *GLB*. β is a *GLB* of α, while a is *α GLB* of γ**

**Figure 3**
An example of a de-identification policy lattice with five quasi-identifying values. Rectangular nodes depict a maximal chain, while oval nodes represent a sublattice.

**Figure 4**
An example of updating the frontier in the R-U space using policies from Figure 3. The current frontier is composed of policies mapped to the stair-step curve. In 4(a), the policy mapped to the square will be added to the frontier because it dominates policies currently on the frontier (i.e., [1, 1, 0, 0] and [1, 1, 1, 0]), which will be removed. In 4(b), the rectangle represents the bounding region of the R-U mapping of policies in sublattice ([0, 0, 0, 1], [1, 0, 1, 1])

**Figure 5. The efficiency of search strategies on the Adult dataset as a function of number of policies searched**

**Figure 6. An empirical evaluation of the sublattice heuristic H()**

**Figure 7. Policies searched and frontier discovered through one run of the SHS algorithm**

**Figure 8. Comparison of the SHS frontier, Safe Harbor, and policies discovered by the algorithm in [6]**

**Figure 9. Examples of two policies (upper and middle) discovered by SHS that strictly dominate Safe Harbor (lower)**

**Figure 10. The search tree of the constraint satisfaction problem defined by the pruned lattice set**

See this image and copyright information in PMC

References

1. Directive 95/46/EC of the european parliament and of the council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, 1995
1. Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A. Anonymizing tables. Proceedings of the 10th International Conference on Database Theory; 2005. pp. 246–258.
1. Arzberger P, Schroeder P, Beaulieu A, et al. Science and government. An international framework to promote access to data. Science. 2004;303(5665):1777–1778. - PubMed
1. Bayardo RJ, Agrawal R. Data privacy through optimal k-anonymization. Proceedings of the 21st International Conference on Data Engineering; 2005. pp. 217–228.
1. Belanger F, Hiller J, Smith W. Trustworthiness in electronic commerce: the role of privacy, security, and site attributes. Journal of Strategic Information Systems. 2002;11:245–270.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Affiliations

Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous