. 2020 Jul 1;27(9):1374-1382.

doi: 10.1093/jamia/ocaa095.

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

David S Carrell¹, Bradley A Malin², David J Cronkite¹, John S Aberdeen³, Cheryl Clark³, Muqun Rachel Li⁴, Dikshya Bastakoty², Steve Nyemba², Lynette Hirschman³

Affiliations

¹ Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.
² Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA.
³ Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA.
⁴ Privacy Analytics Inc, Nashville, Tennessee, USA.

PMID: 32930712
PMCID: PMC7647331
DOI: 10.1093/jamia/ocaa095

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

David S Carrell et al. J Am Med Inform Assoc. 2020.

. 2020 Jul 1;27(9):1374-1382.

doi: 10.1093/jamia/ocaa095.

Authors

David S Carrell¹, Bradley A Malin², David J Cronkite¹, John S Aberdeen³, Cheryl Clark³, Muqun Rachel Li⁴, Dikshya Bastakoty², Steve Nyemba², Lynette Hirschman³

Affiliations

¹ Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA.
² Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA.
³ Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA.
⁴ Privacy Analytics Inc, Nashville, Tennessee, USA.

PMID: 32930712
PMCID: PMC7647331
DOI: 10.1093/jamia/ocaa095

Abstract

Objective: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.

Materials and methods: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers.

Results: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers.

Discussion and conclusions: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.

Keywords: biomedical research; confidentiality; de-identification; electronic health records; natural language processing; privacy.

PubMed Disclaimer

Figures

**Figure 1.**
An illustration of personally identifiable information (PII) in clinical text (A) in its original form, (B) with traditional redaction of PII tagged by a de-identification system that overlooked the identifier “19570319,” and (C) with HIPS resynthesis of PII tagged by the de-identification system, allowing the overlooked identifier “19570319” to hide in plain sight.

**Figure 2.**
Process for creating 1000 candidate hiding in plain sight (HIPS) resynthesized release corpora based on 2000 gold standard (GS) annotated documents at each study site. PII: personally identifying information.

**Figure 3.**
Scatterplot of precision and recall by personally identifying information (PII) type, reader, and corpus in experiment stage 1, without knowledge of actual leak count (inset shows results where recall = 0). KPWA: Kaiser Permanente Washington; PII: personally identifying information; VUMC: Vanderbilt University Medical Center.

**Figure 4.**
Scatterplot comparing (A) leak detection recall, (B) leak detection precision, and (C) leak detection F1 score when the actual leak count is unknown to readers (horizontal axes) vs when actual leak count is known (vertical axes) for 5 personally identifying information types, 4 readers, and 2 corpora (40 results total). Some plots may appear to have fewer than 40 points due to overlapping observations.

See this image and copyright information in PMC

Cited by

Informatics impact requires effective, scalable tools and standards-based infrastructure.
Bakken S. Bakken S. J Am Med Inform Assoc. 2020 Jul 1;27(9):1341-1342. doi: 10.1093/jamia/ocaa187. J Am Med Inform Assoc. 2020. PMID: 32989458 Free PMC article. No abstract available.

References

1. Velupillai S, Suominen H, Liakata M, et al.Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances. J Biomed Inform 2018; 88: 11–9. - PMC - PubMed
1. Koleck TA, Dreisbach C, Bourne PE, Bakken S. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J Am Med Inform Assoc 2019; 26 (4): 364–79. - PMC - PubMed
1. Wang Y, Sohn S, Liu S, et al.A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Mak 2019; 19 (1): 1. - PMC - PubMed
1. Yu S, Ma Y, Gronsbell J, et al.Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc 2018; 25 (1): 54–60. - PMC - PubMed
1. Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med Inform 2019; 7 (2): e12239. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

Affiliations

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources