Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Sep 1;31(9):1904-1911.
doi: 10.1093/jamia/ocae061.

Ensemble pretrained language models to extract biomedical knowledge from literature

Affiliations

Ensemble pretrained language models to extract biomedical knowledge from literature

Zhao Li et al. J Am Med Inform Assoc. .

Abstract

Objectives: The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking.

Materials and methods: For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE).

Results: Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models.

Discussion and conclusion: Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.

Keywords: ensemble learning; knowledge base; large language model; named entity recognition; relation extraction.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) An annotated abstract with six entity types. (B) The number of mentions for each entity type in 400 abstracts for model development.
Figure 2.
Figure 2.
The NLP system for Phase I NER task. (A) The overall architecture of the system. Ensemble learning will first combine the identified mentions from context- and fold-level, and then extract the overlapped predictions for different models as the final prediction. Si is the set of identified mentions from the model in each context, fold, or pretrained language model. C is the number of contexts for each sentence in the abstract, F is the number of folds, and M is the number of pretrained language models we used in this study. (B) An illustration of the cross-sentence scheme for input sequence generation. (C) An illustration of the ensemble learning at context-, fold-, and model-level.
Figure 3.
Figure 3.
An example of generating input text for T0pp training and testing using prompt template. Injecting the meta information of each candidate pair (A) into the prompt template (B), the input text (C) was generated to finetune T0pp to generate/predict the relation types of a given candidate pair in the test set.

References

    1. Luo L, Lai P-T, Wei C-H, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform. 2022;23(5):bbac282. - PMC - PubMed
    1. Akhondi SA, Hettne KM, Van Der Horst E, Van Mulligen EM, Kors JA. Recognition of chemical entities: combining dictionary-based and grammar-based approaches. J Cheminform. 2015;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S10-S11. - PMC - PubMed
    1. Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform. 2005;6(4):357-369. - PubMed
    1. Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform. 2021;22(6):bbab282. - PubMed
    1. Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 150801991, 2015, preprint: not peer reviewed. https://arxiv.org/abs/1508.01991