Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients

Carolin E M Koll et al. Sci Data. .

Abstract

Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Fraction of cases published for the complete NAPKON dataset (a) and (b) the High-Resolution Platform (HAP), the Population-based Platform (POP), and the Cross-Sectoral Platform (SUEP).
Fig. 2
Fig. 2
Comparison of demographic parameters of patients for the original dataset (n = 4,562) and the anonymized dataset (PUF; n = 3,904). The proportions are given in percentage. Note: The percentages in the PUF may be larger if the number of censored cases is unbalanced. (a) Age distribution in years, (b) gender distribution, (c) distribution of quarter and year of first positive SARS-CoV-2 test, and (d) distribution of the disease severity in the course of disease. WHO = World Health Organization.
Fig. 3
Fig. 3
Comparison of demographic parameters of patients for the original data set (n = 4,562) and the anonymized data set (PUF, n = 3,904). (a) Age distribution in years, (b) gender distribution, (c) distribution of quarter and year of first positive SARS-CoV-2 test, and (d) distribution of the disease severity in the course of disease. HAP = High-Resolution Platform; POP = Population-based Platform; SUEP = Cross-Sectoral Platform; WHO = World Health Organization.
Fig. 4
Fig. 4
Comparison of patient status at end of acute phase before and after anonymization (anonymized dataset = PUF). The proportions are given in percentage. Note: The percentages in the PUF may be larger if the number of censored cases is unbalanced. (a) Distribution for the original dataset containing n = 4,562 and resulting anonymized dataset (n = 3,904). (b) Case fatality rates (patient status dead) are computed for the Cross-Sectoral Platform (SUEP) and High- Resolution Platform (HAP) cohorts over different sizes of original dataset. In the plot, the size of the original dataset is adjusted by the number of HAP and SUEP patients. To note, the Population-based Platform (POP) has recruited patients that survived SARS-CoV-2 infection only.
Fig. 5
Fig. 5
Comparison of patient status at end of acute phase before and after anonymization. Distribution for the original dataset containing n = 4,562 and resulting anonymized dataset (PUF, n = 3,904). HAP = High-Resolution Platform; POP = Population-based Platform; SUEP = Cross-Sectoral Platform.
Fig. 6
Fig. 6
Case fatality rate for the High-Resolution Platform (HAP). Anonymized dataset = PUF.
Fig. 7
Fig. 7
Case fatality rate for the Cross-Sectoral Platform (SUEP). Anonymized dataset = PUF.
Fig. 8
Fig. 8
Odds ratios (OR) and 95%-confidence intervals (CI) of patient characteristics and outcomes in the dataset before and after anonymization for different sizes of the original dataset (anonymized dataset = PUF). In the graphs, the number of records in the original dataset was adjusted according to the number of cases included in the regression analysis, excluding missing data. To note, datasets with no ORs and CIs do not contain the relevant information for the respective regression model. (a) Inpatient cases from the Cross-Sectoral Platform (SUEP). (b) Cases from the High-Resolution Platform (HAP) and inpatient SUEP aged between 49 and 59 years that survived the acute phase of COVID-19 (ambulant or discharged). (c) Cases from the Population-Based Platform (POP). (d) Cases from the High-Resolution Platform (HAP).
Fig. 9
Fig. 9
Re-identification risks based on the uniqueness of k-variables before and after anonymization for different sizes of the original dataset (anonymized = PUF).

References

    1. Ahn DG, et al. Current Status of Epidemiology, Diagnosis, Therapeutics, and Vaccines for Novel Coronavirus Disease 2019 (COVID-19) J Microbiol Biotechnol. 2020;30:313–324. doi: 10.4014/jmb.2003.03011. - DOI - PMC - PubMed
    1. Bchetnia M, Girard C, Duchaine C, Laprise C. The outbreak of the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2): A review of the current global status. J Infect Public Health. 2020;13:1601–1610. doi: 10.1016/j.jiph.2020.07.011. - DOI - PMC - PubMed
    1. Sarangi MK, et al. Diagnosis, prevention, and treatment of coronavirus disease: a review. Expert Rev Anti Infect Ther. 2022;20:243–266. doi: 10.1080/14787210.2021.1944103. - DOI - PubMed
    1. Schons, M. et al. The German National Pandemic Cohort Network (NAPKON): rationale, study design and baseline characteristics. Eur J Epidemiol (2022). - PMC - PubMed
    1. Naqvi A. COVID-19 European regional tracker. Sci Data. 2021;8:181. doi: 10.1038/s41597-021-00950-7. - DOI - PMC - PubMed