Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov:75S:S34-S42.
doi: 10.1016/j.jbi.2017.05.023. Epub 2017 Jun 1.

De-identification of clinical notes via recurrent neural network and conditional random field

Affiliations

De-identification of clinical notes via recurrent neural network and conditional random field

Zengjian Liu et al. J Biomed Inform. 2017 Nov.

Abstract

De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.

Keywords: De-identification; Ensemble system; Natural language processing; Protected health information; Recurrent neural network.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview architecture of our de-identification system.
Figure 2
Figure 2
Overview architecture of BI-LSTM.
Figure 3
Figure 3
Overview architecture of BI-LSTM-FEA.
Figure 4
Figure 4
The strict F1-scores of CRF, BI-LSTM, BI-LSTM-FEA and Ensemble models for each main PHI category, where CONT., LOC. and PROF. represent the CONTACT, LOCATION and PROFESSION categories respectively. The LOCATION and PROFESSION categories are displayed in a separated subgraph as the F1-scores of them are much lower than others. The CONTACT and ID categories on the 2016 N-GRID corpus are not shown here as they are all same predicted by rules in all above methods: the F1-scores of them are 92.00% and 65.45% respectively.

References

    1. A. Act. Health insurance portability and accountability act of 1996. Public Law. 1996;104:191. - PubMed
    1. Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association. 2007;14(5):550–563. - PMC - PubMed
    1. Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. Journal of biomedical informatics. 2015;58:S20–S29. - PMC - PubMed
    1. Uzuner Ö, Stubbs A. Practical applications for natural language processing in clinical research. Journal of Biomedical Informatics. 2015;58(S):S1–S5. - PMC - PubMed
    1. Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of biomedical informatics. 2015;58:S11–S19. - PMC - PubMed

MeSH terms