Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 19:6:689803.
doi: 10.3389/frma.2021.689803. eCollection 2021.

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Affiliations

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Nona Naderi et al. Front Res Metr Anal. .

Abstract

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

Keywords: chemical patents; clinical NER; clinical text mining; deep learning; named entity recognition; patent text mining; transformers; wet lab protocols.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
An example of a patent passage of the ChEMU dataset with entity annotations. The annotations are color-coded, representing the different entities in the dataset.
FIGURE 2
FIGURE 2
An example of a clinical narrative of the DEFT dataset with entity annotations. The annotations are color-coded, representing the different entities in the dataset. Notice that some entities are nested.
FIGURE 3
FIGURE 3
An example of a wet lab protocol of the WNUT dataset with entity annotations. The annotations are color-coded, representing the different entities in the dataset.
FIGURE 4
FIGURE 4
Schematic presentation of the ensemble model. Individual models are fine-tuned with specific task data. Then, they are used to classify tokens independently. The predictions are then combined using majority voting.
FIGURE 5
FIGURE 5
(A): Performance of the BERT model vs. the ensemble model based on the entity frequency on the training data. (B): Performance of the BERT model vs. the ensemble model based on the entity length on the training data. In both, the individual BERT for the three datasets is BERT-base-cased for ChEMU, BERT-base-multilingual-cased for DEFT, and Bio + Clinical BERT for WNUT.
FIGURE 6
FIGURE 6
The number of labels assigned to each passage for the training set of the three datasets (ChEMU, DEFT, and WNUT).

References

    1. Acharya K. (2020). “WNUT 2020 Shared Task-1: Conditional Random Field(CRF) Based Named Entity Recognition(NER) for Wet Lab Protocols,” in Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) (Online: Association for Computational Linguistics; ), 286–289. 10.18653/v1/2020.wnut-1.37 - DOI
    1. Akhondi S. A., Klenner A. G., Tyrchan C., Manchala A. K., Boppana K., Lowe D., et al. (2014). Annotated Chemical Patent Corpus: A Gold Standard for Text Mining. PLoS ONE 9, e107477. 10.1371/journal.pone.0107477 - DOI - PMC - PubMed
    1. Akhondi S. A., Pons E., Afzal Z., van Haagen H., Becker B. F., Hettne K. M., et al. (2016). Chemical Entity Recognition in Patents by Combining Dictionary-Based and Statistical Approaches. Database 2016, baw061. 10.1093/database/baw061 - DOI - PMC - PubMed
    1. Alsentzer E., Murphy J., Boag W., Weng W., Jindi D., Naumann T., et al. (2019). “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop Minneapolis, Minnesota, United States: Association for Computational Linguistics, 72–78.
    1. Andrioli de Souza J. V., Terumi Rubel Schneider E., Oliveira Cezar J., Silva e Oliveira L. E., Bonescki Gumiel Y., Cabrera Paraiso E., et al. (2020). “A Multilabel Approach to Portuguese Clinical Named Entity Recognition,” in Proceedings of the XVII Congresso Brasileiro de Informática em Saúde (CBIS 2020). published in Journal of health informatics, 7-11 December 2020.