Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013:2013:59-70.
doi: 10.1145/2435349.2435357.

Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Affiliations

Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier

Weiyi Xia et al. CODASPY. 2013.

Abstract

Modern information technologies enable organizations to capture large quantities of person-specific data while providing routine services. Many organizations hope, or are legally required, to share such data for secondary purposes (e.g., validation of research findings) in a de-identified manner. In previous work, it was shown de-identification policy alternatives could be modeled on a lattice, which could be searched for policies that met a prespecified risk threshold (e.g., likelihood of re-identification). However, the search was limited in several ways. First, its definition of utility was syntactic - based on the level of the lattice - and not semantic - based on the actual changes induced in the resulting data. Second, the threshold may not be known in advance. The goal of this work is to build the optimal set of policies that trade-off between privacy risk (R) and utility (U), which we refer to as a R-U frontier. To model this problem, we introduce a semantic definition of utility, based on information theory, that is compatible with the lattice representation of policies. To solve the problem, we initially build a set of policies that define a frontier. We then use a probability-guided heuristic to search the lattice for policies likely to update the frontier. To demonstrate the effectiveness of our approach, we perform an empirical analysis with the Adult dataset of the UCI Machine Learning Repository. We show that our approach can construct a frontier closer to optimal than competitive approaches by searching a smaller number of policies. In addition, we show that a frequently followed de-identification policy (i.e., the Safe Harbor standard of the HIPAA Privacy Rule) is suboptimal in comparison to the frontier discovered by our approach.

Keywords: De-identification; Experimentation; Management; Optimization; Policy; Privacy; Security.

PubMed Disclaimer

Figures

Figure 1
Figure 1. An example of a de-identification policy defined over three quasi-identifying attributes, {Age, Gender, ZIP}
Figure 2
Figure 2. An example of GLB. β is a GLB of α, while a is α GLB of γ
Figure 3
Figure 3
An example of a de-identification policy lattice with five quasi-identifying values. Rectangular nodes depict a maximal chain, while oval nodes represent a sublattice.
Figure 4
Figure 4
An example of updating the frontier in the R-U space using policies from Figure 3. The current frontier is composed of policies mapped to the stair-step curve. In 4(a), the policy mapped to the square will be added to the frontier because it dominates policies currently on the frontier (i.e., [1, 1, 0, 0] and [1, 1, 1, 0]), which will be removed. In 4(b), the rectangle represents the bounding region of the R-U mapping of policies in sublattice ([0, 0, 0, 1], [1, 0, 1, 1])
Figure 5
Figure 5. The efficiency of search strategies on the Adult dataset as a function of number of policies searched
Figure 6
Figure 6. An empirical evaluation of the sublattice heuristic H()
Figure 7
Figure 7. Policies searched and frontier discovered through one run of the SHS algorithm
Figure 8
Figure 8. Comparison of the SHS frontier, Safe Harbor, and policies discovered by the algorithm in [6]
Figure 9
Figure 9. Examples of two policies (upper and middle) discovered by SHS that strictly dominate Safe Harbor (lower)
Figure 10
Figure 10. The search tree of the constraint satisfaction problem defined by the pruned lattice set

References

    1. Directive 95/46/EC of the european parliament and of the council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, 1995

    1. Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A. Anonymizing tables. Proceedings of the 10th International Conference on Database Theory; 2005. pp. 246–258.
    1. Arzberger P, Schroeder P, Beaulieu A, et al. Science and government. An international framework to promote access to data. Science. 2004;303(5665):1777–1778. - PubMed
    1. Bayardo RJ, Agrawal R. Data privacy through optimal k-anonymization. Proceedings of the 21st International Conference on Data Engineering; 2005. pp. 217–228.
    1. Belanger F, Hiller J, Smith W. Trustworthiness in electronic commerce: the role of privacy, security, and site attributes. Journal of Strategic Information Systems. 2002;11:245–270.

LinkOut - more resources