The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge
- PMID: 28478268
- PMCID: PMC5670015
- DOI: 10.1016/j.jbi.2017.05.001
The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge
Abstract
Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute.
Keywords: Automatic de-identification; Clinical natural language processing; Machine learning; Shared task.
Copyright © 2017 Elsevier Inc. All rights reserved.
Conflict of interest statement
The authors have no conflicts of interest to declare.
Figures
References
-
- Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health Human Services. 2012
-
- Douglass M, Clifford G, Reisner A, Moody G, Mark R, editors. Computers in Cardiology. Vol. 2004. IEEE; 2004. Computer-assisted de-identification of free text in the MIMIC II database.
-
- Grishman R, Kittredge R. Analyzing language in restricted domains: sublanguage description and processing. Psychology Press; 2014.
-
- Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002;35(4):222–35. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
