Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
- PMID: 34870074
- PMCID: PMC8640190
- DOI: 10.3389/frma.2021.689803
Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora
Abstract
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
Keywords: chemical patents; clinical NER; clinical text mining; deep learning; named entity recognition; patent text mining; transformers; wet lab protocols.
Copyright © 2021 Naderi, Knafou, Copara, Ruch and Teodoro.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
-
- Acharya K. (2020). “WNUT 2020 Shared Task-1: Conditional Random Field(CRF) Based Named Entity Recognition(NER) for Wet Lab Protocols,” in Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020) (Online: Association for Computational Linguistics; ), 286–289. 10.18653/v1/2020.wnut-1.37 - DOI
-
- Alsentzer E., Murphy J., Boag W., Weng W., Jindi D., Naumann T., et al. (2019). “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop Minneapolis, Minnesota, United States: Association for Computational Linguistics, 72–78.
-
- Andrioli de Souza J. V., Terumi Rubel Schneider E., Oliveira Cezar J., Silva e Oliveira L. E., Bonescki Gumiel Y., Cabrera Paraiso E., et al. (2020). “A Multilabel Approach to Portuguese Clinical Named Entity Recognition,” in Proceedings of the XVII Congresso Brasileiro de Informática em Saúde (CBIS 2020). published in Journal of health informatics, 7-11 December 2020.
LinkOut - more resources
Full Text Sources
Miscellaneous
