Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 6;16(1):3280.
doi: 10.1038/s41467-025-56989-2.

Benchmarking large language models for biomedical natural language processing applications and recommendations

Affiliations

Benchmarking large language models for biomedical natural language processing applications and recommendations

Qingyu Chen et al. Nat Commun. .

Abstract

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Dr. Jingcheng Du and Dr. Hua Xu have research-related financial interests at Melax Technologies Inc. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Dynamic K-nearest few-shot results (K = 1, 2, and 5) shown in line charts, with associated costs (dollars per 100 instances) depicted in bar charts for each benchmark.
The input and output types for each benchmark are displayed at the bottom of each subplot. Detailed methods for the few-shot and cost analysis are summarized in the Data and Methods section. Dynamic K-nearest few-shot involves selecting the K closest training instances as examples for each testing instance. Additionally, the performance of static one-shot (using the same one-shot example for each testing instance) is shown as a dashed horizontal line for comparison. Detailed performance in digits is also provided in Supplementary Information S2.
Fig. 2
Fig. 2. Qualitative evaluation results on inconsistency, missing information, and hallucinations.
A Error analysis on the named entity recognition benchmark NCBI Disease. Correct entities: the predicted entities are correct with both text spans and entity types; Wrong entities: the predicted entities are incorrect; Missing entities: true entities are not predicted; and Boundary issues: the predicted entities are correct but with different text spans than the gold standard. BD Qualitative evaluation on ChemProt, HoC, and MedQA where the gold standard is a fixed classification type or multiple-choice option. Inconsistent responses: the responses are in different formats; Missingness: the responses are missing; and Hallucinations, where LLMs fail to address the prompt and may contain repetitions and misinformation in the output.
Fig. 3
Fig. 3. Qualitative evaluation results on accuracy, completeness, and readability.
A The overall results of the fine-tuned BART, GPT-3.5 zero-shot, GPT-4 zero-shot, and LLaMA 2 zero-shot models on a scale of 1 to 5, based on random 50 testing instances from the PubMed Text Summarization dataset. B and C display the number of winning, tying, and losing cases when comparing GPT-4 zero-shot to GPT-3.5 zero-shot and GPT-4 zero-shot to the fine-tuned BART model, respectively. Table 4 shows the results in digits for complementary. Detailed results, including statistical tests and examples, are provided in Supplementary Information S3.
Fig. 4
Fig. 4. Recommendations for using LLMs in BioNLP applications.
It presents specific task-based recommendations across different settings and offers general guidance on effectively applying LLMs in BioNLP.

Update of

References

    1. Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res.51, D29–D38 (2023). - PMC - PubMed
    1. Chen, Q. et al. LitCovid in 2022: an information resource for the COVID-19 literature. Nucleic Acids Res.51, D1512–D1518 (2023). - PMC - PubMed
    1. Leaman, R. et al. Comprehensively identifying long COVID articles with human-in-the-loop machine learning. Patterns4, 100659 (2023). - PMC - PubMed
    1. Chen, Q. et al. BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput. Biol.16, e1007617 (2020). - PMC - PubMed
    1. Blake, C. Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles. J. Biomed. Inform.43, 173–189 (2010). - PubMed

LinkOut - more resources