Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov:75S:S54-S61.
doi: 10.1016/j.jbi.2017.05.001. Epub 2017 May 3.

The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge

Affiliations

The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge

Duy Duc An Bui et al. J Biomed Inform. 2017 Nov.

Abstract

Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute.

Keywords: Automatic de-identification; Clinical natural language processing; Machine learning; Shared task.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest.

The authors have no conflicts of interest to declare.

Figures

Figure 1
Figure 1
An overview of the system architecture.
Figure 2
Figure 2
Pseudocode to disambiguate COUNTRY terms

References

    1. Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health Human Services. 2012
    1. Douglass M, Clifford G, Reisner A, Moody G, Mark R, editors. Computers in Cardiology. Vol. 2004. IEEE; 2004. Computer-assisted de-identification of free text in the MIMIC II database.
    1. Grishman R, Kittredge R. Analyzing language in restricted domains: sublanguage description and processing. Psychology Press; 2014.
    1. Friedman C, Kra P, Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002;35(4):222–35. - PubMed
    1. Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63. - PMC - PubMed

LinkOut - more resources