Ensemble pretrained language models to extract biomedical knowledge from literature

Affiliations

¹ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States.
² Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.

PMID: 38520725
PMCID: PMC11339500
DOI: 10.1093/jamia/ocae061

Ensemble pretrained language models to extract biomedical knowledge from literature

Zhao Li et al. J Am Med Inform Assoc. 2024.

. 2024 Sep 1;31(9):1904-1911.

doi: 10.1093/jamia/ocae061.

Authors

Affiliations

¹ McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States.
² Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States.

PMID: 38520725
PMCID: PMC11339500
DOI: 10.1093/jamia/ocae061

Abstract

Objectives: The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.

Materials and methods: For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).

Results: Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.

Discussion and conclusion: Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.

Keywords: ensemble learning; knowledge base; large language model; named entity recognition; relation extraction.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
(A) An annotated abstract with six entity types. (B) The number of mentions for each entity type in 400 abstracts for model development.

**Figure 2.**
The NLP system for Phase I NER task. (A) The overall architecture of the system. Ensemble learning will first combine the identified mentions from context- and fold-level, and then extract the overlapped predictions for different models as the final prediction. $S_{i}$ is the set of identified mentions from the model in each context, fold, or pretrained language model. $C$ is the number of contexts for each sentence in the abstract, $F$ is the number of folds, and $M$ is the number of pretrained language models we used in this study. (B) An illustration of the cross-sentence scheme for input sequence generation. (C) An illustration of the ensemble learning at context-, fold-, and model-level.

**Figure 3.**
An example of generating input text for T0pp training and testing using prompt template. Injecting the meta information of each candidate pair (A) into the prompt template (B), the input text (C) was generated to finetune T0pp to generate/predict the relation types of a given candidate pair in the test set.

See this image and copyright information in PMC

References

1. Luo L, Lai P-T, Wei C-H, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform. 2022;23(5):bbac282. - PMC - PubMed
1. Akhondi SA, Hettne KM, Van Der Horst E, Van Mulligen EM, Kors JA. Recognition of chemical entities: combining dictionary-based and grammar-based approaches. J Cheminform. 2015;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10-S11. - PMC - PubMed
1. Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357-369. - PubMed
1. Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform. 2021;22(6):bbab282. - PubMed
1. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 150801991, 2015, preprint: not peer reviewed. https://arxiv.org/abs/1508.01991

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ensemble pretrained language models to extract biomedical knowledge from literature

Affiliations

Ensemble pretrained language models to extract biomedical knowledge from literature

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources