Evaluating the state-of-the-art in automatic de-identification
- PMID: 17600094
- PMCID: PMC1975792
- DOI: 10.1197/jamia.M2444
Evaluating the state-of-the-art in automatic de-identification
Abstract
To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.
Figures








Similar articles
-
Identifying patient smoking status from medical discharge records.J Am Med Inform Assoc. 2008 Jan-Feb;15(1):14-24. doi: 10.1197/jamia.M2408. Epub 2007 Oct 18. J Am Med Inform Assoc. 2008. PMID: 17947624 Free PMC article.
-
Automated de-identification of free-text medical records.BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32. BMC Med Inform Decis Mak. 2008. PMID: 18652655 Free PMC article.
-
Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S47-S52. doi: 10.1016/j.jbi.2015.06.009. Epub 2015 Jun 26. J Biomed Inform. 2015. PMID: 26122526 Free PMC article.
-
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28. J Biomed Inform. 2015. PMID: 26225918 Free PMC article. Review.
-
Automatic de-identification of textual documents in the electronic health record: a review of recent research.BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70. BMC Med Res Methodol. 2010. PMID: 20678228 Free PMC article. Review.
Cited by
-
Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.J Am Med Inform Assoc. 2020 Jul 1;27(9):1374-1382. doi: 10.1093/jamia/ocaa095. J Am Med Inform Assoc. 2020. PMID: 32930712 Free PMC article.
-
Automated Detection of Substance-Use Status and Related Information from Clinical Text.Sensors (Basel). 2022 Dec 8;22(24):9609. doi: 10.3390/s22249609. Sensors (Basel). 2022. PMID: 36559979 Free PMC article.
-
Ensemble-based Methods to Improve De-identification of Electronic Health Record Narratives.AMIA Annu Symp Proc. 2018 Dec 5;2018:663-672. eCollection 2018. AMIA Annu Symp Proc. 2018. PMID: 30815108 Free PMC article.
-
Improved de-identification of physician notes through integrative modeling of both public and private medical text.BMC Med Inform Decis Mak. 2013 Oct 2;13:112. doi: 10.1186/1472-6947-13-112. BMC Med Inform Decis Mak. 2013. PMID: 24083569 Free PMC article.
-
Improving domain adaptation in de-identification of electronic health records through self-training.J Am Med Inform Assoc. 2021 Sep 18;28(10):2093-2100. doi: 10.1093/jamia/ocab128. J Am Med Inform Assoc. 2021. PMID: 34363664 Free PMC article.
References
-
- Rollman B, Hanusa B, Gilbert T, Lowe H, Kapoor W, Schulberg H. The Electronic Medical Record Arch Intern Med 2001;161:89. - PubMed
-
- Cao H, Stetson P, Hripcsak G. Assessing Explicit Error Reporting in the Narrative Electronic Medical Record Using Keyword Searching J Biomed Inform 2004;36:99-105. - PubMed
-
- Chapman W, Bridewell W, Hanbury P, Cooper G, Buchanan B. A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries J Biomed Inform 2001;34:301-310. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources