Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 23;21(9):e1013507.
doi: 10.1371/journal.pcbi.1013507. eCollection 2025 Sep.

Ten quick tips for protecting health data using de-identification and perturbation of structured datasets

Affiliations

Ten quick tips for protecting health data using de-identification and perturbation of structured datasets

Tshikala Eddie Lulamba et al. PLoS Comput Biol. .

Abstract

Structured patient data generated within the health data ecosystem are shared both internally for operational use and also externally for research and public health benefit. Protecting individual privacy and health data confidentiality in these contexts relies on data de-identification and anonymisation, although there are no universally accepted standards for these processes and the techniques involved can be technically complex. We present practical recommendations grounded in the principle of data minimisation-avoiding unnecessary granularity and identifying variables that could lead to re-identification when combined with other datasets. We provide practical guidance for anonymising and perturbing structured health data in ways that support compliance with data protection laws, describing technical and operational methods for reducing re-identification risk that include rounding numerical values, replacing precise values with ranges, adding jitter to numeric fields, aggregating data, management of date values and separating sensitive fields from identifying data to prevent linkage leading to re-identification. While some methods require advanced technical knowledge, we focus here on accessible strategies that can be implemented without specialist expertise, recognising the importance of the legal and governance frameworks in which anonymisation occurs. These guidelines support researchers, data managers and institutions in sharing health data responsibly, maintaining data utility while upholding privacy and promoting ethical and legal data stewardship for data-driven health research.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Anonymising precise numerical values.
Rounding precise values to decimal places or significant figures can ensure k-anonymity is preserved whilst retaining variable characteristics and epidemiological meaning (artificial dataset). A: Birthweights (kg) dataset with 4-decimal place precision, B: Birthweights (kg) dataset rounded to one decimal place. C: Precise number of exercise days per year; D: Number of exercise days per year with jitter in range −5 to +5 days.
Fig 2
Fig 2. Checking bivariate correlation before and after perturbation.
An exhaustive bi-variate correlation matrix shows that the bivariate correlation relationships remain generally similar despite perturbation. Red shading indicates positive correlation, blue shading indicates negative correlation. Values within each cell show the correlation coefficient. A: Original dataset, B: Dataset after perturbation of multiple fields.
Fig 3
Fig 3. Checking k-anonymisation before and after perturbation.
Creating categories based on numerical value range can increase k-anonymity (artificial dataset with x-axis = value/category, and y-axis = counts per value/category). A: Exact integer variables ranging from 1 to 20, B: Categorical variables derived from integer variables ranging from 1 to 20.

References

    1. Bonomi L, Huang Y, Ohno-Machado L. Privacy challenges and research opportunities for genomic data sharing. Nat Genet. 2020;52(7):646–54. doi: 10.1038/s41588-020-0651-0 - DOI - PMC - PubMed
    1. World Health Organisation. Sharing and reuse of health-related data for research purposes: WHO policy and implementation guidance. 2022. Available from: https://iris.who.int/bitstream/handle/10665/352859/9789240044968-eng.pdf...
    1. Sweeney L, Abu A, Winn J. Identifying participants in the personal genome project by name (A re-identification experiment). arXiv; 2013. doi: 10.48550/arXiv.1304.7605 - DOI
    1. Sweeney L. Simple demographics often identify people uniquely. Carnegie Mellon University; 2000.
    1. Ni C, Cang LS, Gope P, Min G. Data anonymization evaluation for big data and IoT environment. Inf Sci. 2022;605:381–92. doi: 10.1016/j.ins.2022.05.040 - DOI

LinkOut - more resources