Text de-identification for privacy protection: a study of its impact on clinical text information content
- PMID: 24502938
- DOI: 10.1016/j.jbi.2014.01.011
Text de-identification for privacy protection: a study of its impact on clinical text information content
Abstract
As more and more electronic clinical information is becoming easier to access for secondary uses such as clinical research, approaches that enable faster and more collaborative research while protecting patient privacy and confidentiality are becoming more important. Clinical text de-identification offers such advantages but is typically a tedious manual process. Automated Natural Language Processing (NLP) methods can alleviate this process, but their impact on subsequent uses of the automatically de-identified clinical narratives has only barely been investigated. In the context of a larger project to develop and investigate automated text de-identification for Veterans Health Administration (VHA) clinical notes, we studied the impact of automated text de-identification on clinical information in a stepwise manner. Our approach started with a high-level assessment of clinical notes informativeness and formatting, and ended with a detailed study of the overlap of select clinical information types and Protected Health Information (PHI). To investigate the informativeness (i.e., document type information, select clinical data types, and interpretation or conclusion) of VHA clinical notes, we used five different existing text de-identification systems. The informativeness was only minimally altered by these systems while formatting was only modified by one system. To examine the impact of de-identification on clinical information extraction, we compared counts of SNOMED-CT concepts found by an open source information extraction application in the original (i.e., not de-identified) version of a corpus of VHA clinical notes, and in the same corpus after de-identification. Only about 1.2-3% less SNOMED-CT concepts were found in de-identified versions of our corpus, and many of these concepts were PHI that was erroneously identified as clinical information. To study this impact in more details and assess how generalizable our findings were, we examined the overlap between select clinical information annotated in the 2010 i2b2 NLP challenge corpus and automatic PHI annotations from our best-of-breed VHA clinical text de-identification system (nicknamed 'BoB'). Overall, only 0.81% of the clinical information exactly overlapped with PHI, and 1.78% partly overlapped. We conclude that automated text de-identification's impact on clinical information is small, but not negligible, and that improved clinical acronyms and eponyms disambiguation could significantly reduce this impact.
Keywords: Confidentiality, patient data privacy; De-identification, Anonymization, Electronic health records; Medical informatics; Natural Language Processing; United States department of veterans affairs.
Copyright © 2014 Elsevier Inc. All rights reserved.
Similar articles
-
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.J Am Med Inform Assoc. 2013 Jan 1;20(1):77-83. doi: 10.1136/amiajnl-2012-001020. Epub 2012 Sep 4. J Am Med Inform Assoc. 2013. PMID: 22947391 Free PMC article.
-
Evaluating current automatic de-identification methods with Veteran's health administration clinical documents.BMC Med Res Methodol. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109. BMC Med Res Methodol. 2012. PMID: 22839356 Free PMC article.
-
Generalizability and comparison of automatic clinical text de-identification methods and resources.AMIA Annu Symp Proc. 2012;2012:199-208. Epub 2012 Nov 3. AMIA Annu Symp Proc. 2012. PMID: 23304289 Free PMC article.
-
Patient Privacy in the Era of Big Data.Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13. Balkan Med J. 2018. PMID: 28903886 Free PMC article. Review.
-
Automatic de-identification of textual documents in the electronic health record: a review of recent research.BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70. BMC Med Res Methodol. 2010. PMID: 20678228 Free PMC article. Review.
Cited by
-
State of the art and a mixed-method personalized approach to assess patient perceptions on medical record sharing and sensitivity.J Biomed Inform. 2020 Jan;101:103338. doi: 10.1016/j.jbi.2019.103338. Epub 2019 Nov 11. J Biomed Inform. 2020. PMID: 31726102 Free PMC article.
-
Automated redaction of names in adverse event reports using transformer-based neural networks.BMC Med Inform Decis Mak. 2024 Dec 23;24(1):401. doi: 10.1186/s12911-024-02785-9. BMC Med Inform Decis Mak. 2024. PMID: 39716217 Free PMC article.
-
Synthetic4Health: generating annotated synthetic clinical letters.Front Digit Health. 2025 May 30;7:1497130. doi: 10.3389/fdgth.2025.1497130. eCollection 2025. Front Digit Health. 2025. PMID: 40520216 Free PMC article.
-
Using word embeddings to improve the privacy of clinical notes.J Am Med Inform Assoc. 2020 Jun 1;27(6):901-907. doi: 10.1093/jamia/ocaa038. J Am Med Inform Assoc. 2020. PMID: 32388549 Free PMC article.
-
Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.Yearb Med Inform. 2015 Aug 13;10(1):183-93. doi: 10.15265/IY-2015-009. Yearb Med Inform. 2015. PMID: 26293867 Free PMC article. Review.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials