Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 1;27(9):1374-1382.
doi: 10.1093/jamia/ocaa095.

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

Affiliations

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

David S Carrell et al. J Am Med Inform Assoc. .

Abstract

Objective: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.

Materials and methods: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers.

Results: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers.

Discussion and conclusions: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.

Keywords: biomedical research; confidentiality; de-identification; electronic health records; natural language processing; privacy.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An illustration of personally identifiable information (PII) in clinical text (A) in its original form, (B) with traditional redaction of PII tagged by a de-identification system that overlooked the identifier “19570319,” and (C) with HIPS resynthesis of PII tagged by the de-identification system, allowing the overlooked identifier “19570319” to hide in plain sight.
Figure 2.
Figure 2.
Process for creating 1000 candidate hiding in plain sight (HIPS) resynthesized release corpora based on 2000 gold standard (GS) annotated documents at each study site. PII: personally identifying information.
Figure 3.
Figure 3.
Scatterplot of precision and recall by personally identifying information (PII) type, reader, and corpus in experiment stage 1, without knowledge of actual leak count (inset shows results where recall = 0). KPWA: Kaiser Permanente Washington; PII: personally identifying information; VUMC: Vanderbilt University Medical Center.
Figure 4.
Figure 4.
Scatterplot comparing (A) leak detection recall, (B) leak detection precision, and (C) leak detection F1 score when the actual leak count is unknown to readers (horizontal axes) vs when actual leak count is known (vertical axes) for 5 personally identifying information types, 4 readers, and 2 corpora (40 results total). Some plots may appear to have fewer than 40 points due to overlapping observations.

Similar articles

Cited by

References

    1. Velupillai S, Suominen H, Liakata M, et al.Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J Biomed Inform 2018; 88: 11–9. - PMC - PubMed
    1. Koleck TA, Dreisbach C, Bourne PE, Bakken S. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J Am Med Inform Assoc 2019; 26 (4): 364–79. - PMC - PubMed
    1. Wang Y, Sohn S, Liu S, et al.A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Mak 2019; 19 (1): 1. - PMC - PubMed
    1. Yu S, Ma Y, Gronsbell J, et al.Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc 2018; 25 (1): 54–60. - PMC - PubMed
    1. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform 2019; 7 (2): e12239. - PMC - PubMed

Publication types