Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes
- PMID: 28602908
- PMCID: PMC5705401
- DOI: 10.1016/j.jbi.2017.06.005
Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes
Abstract
De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
Keywords: Clinical text mining; De-identification; Electronic health record; Information extraction; Named entity recognition.
Copyright © 2017. Published by Elsevier Inc.
Conflict of interest statement
None.
Figures
Similar articles
-
Automated de-identification of free-text medical records.BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32. BMC Med Inform Decis Mak. 2008. PMID: 18652655 Free PMC article.
-
Combining knowledge- and data-driven methods for de-identification of clinical narratives.J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S53-S59. doi: 10.1016/j.jbi.2015.06.029. Epub 2015 Jul 22. J Biomed Inform. 2015. PMID: 26210359 Free PMC article.
-
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records.AMIA Annu Symp Proc. 2024 Jan 11;2023:814-823. eCollection 2023. AMIA Annu Symp Proc. 2024. PMID: 38222389 Free PMC article.
-
Automatic de-identification of textual documents in the electronic health record: a review of recent research.BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70. BMC Med Res Methodol. 2010. PMID: 20678228 Free PMC article. Review.
-
Patient Privacy in the Era of Big Data.Balkan Med J. 2018 Jan 20;35(1):8-17. doi: 10.4274/balkanmedj.2017.0966. Epub 2017 Sep 13. Balkan Med J. 2018. PMID: 28903886 Free PMC article. Review.
Cited by
-
Clinical concept extraction: A methodology review.J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6. J Biomed Inform. 2020. PMID: 32768446 Free PMC article. Review.
-
Clinical Text Data in Machine Learning: Systematic Review.JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984. JMIR Med Inform. 2020. PMID: 32229465 Free PMC article. Review.
-
De-identification of free text data containing personal health information: a scoping review of reviews.Int J Popul Data Sci. 2023 Dec 12;8(1):2153. doi: 10.23889/ijpds.v8i1.2153. eCollection 2023. Int J Popul Data Sci. 2023. PMID: 38414537 Free PMC article.
-
Understanding Views Around the Creation of a Consented, Donated Databank of Clinical Free Text to Develop and Train Natural Language Processing Models for Research: Focus Group Interviews With Stakeholders.JMIR Med Inform. 2023 May 3;11:e45534. doi: 10.2196/45534. JMIR Med Inform. 2023. PMID: 37133927 Free PMC article.
-
Should free-text data in electronic medical records be shared for research? A citizens' jury study in the UK.J Med Ethics. 2020 Jun;46(6):367-377. doi: 10.1136/medethics-2019-105472. Epub 2020 May 26. J Med Ethics. 2020. PMID: 32457202 Free PMC article.
References
-
- Carrell D, Malin B, Aberdeen J, Bayer S, Clark C, Wellner B, et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc. 2013;20(2):342–348. http://dx.doi.org/10.1136/amiajnl-2012–001034. - DOI - PMC - PubMed
-
- Scaiano M, Middleton G, Arbuckle L, Kolhatkar V, Peyton L, Dowling M, et al. A unified framework for evaluating the risk of re-identification of text de-identification tools. J Biomed Inform. 2016;63:174–183. http://doi.org/10.1016/j.jbi.2016.07.015. - DOI - PubMed
-
- Carrell DS, Cronkite DJ, Malin BA, Aberdeen JS, Hirschman L. Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification. Methods Inf Med. 2016;55(4):356–364. http://dx.doi.org/10.3414/ME15–01–0122. - DOI - PMC - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical