Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 1;32(6):1015-1024.
doi: 10.1093/jamia/ocaf045.

Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks

Affiliations

Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks

Felix J Dorfner et al. J Am Med Inform Assoc. .

Abstract

Objectives: Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. However, the effectiveness of this approach remains unclear. This study aims to critically evaluate the performance of biomedically fine-tuned LLMs against their general-purpose counterparts across a range of clinical tasks.

Materials and methods: We evaluated the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on clinical case challenges from NEJM and JAMA, and on multiple clinical tasks, such as information extraction, document summarization and clinical coding. We used a diverse set of benchmarks specifically chosen to be outside the likely fine-tuning datasets of biomedical models, ensuring a fair assessment of generalization capabilities.

Results: Biomedical LLMs generally underperformed compared to general-purpose models, especially on tasks not focused on probing medical knowledge. While on the case challenges, larger biomedical and general-purpose models showed similar performance (eg, OpenBioLLM-70B: 66.4% vs Llama-3-70B-Instruct: 65% on JAMA), smaller biomedical models showed more pronounced underperformance (OpenBioLLM-8B: 30% vs Llama-3-8B-Instruct: 64.3% on NEJM). Similar trends appeared across CLUE benchmarks, with general-purpose models often achieving higher scores in text generation, question answering, and coding. Notably, biomedical LLMs also showed a higher tendency to hallucinate.

Discussion: Our findings challenge the assumption that biomedical fine-tuning inherently improves LLM performance, as general-purpose models consistently performed better on unseen medical tasks. Retrieval-augmented generation may offer a more effective strategy for clinical adaptation.

Conclusion: Fine-tuning LLMs on biomedical data may not yield the anticipated benefits. Alternative approaches, such as retrieval augmentation, should be further explored for effective and reliable clinical integration of LLMs.

Keywords: benchmarking; biomedical fine-tuning; domain-specific adaptation; hallucination in AI models; large language models (LLMs).

PubMed Disclaimer

Conflict of interest statement

None declared.

References

    1. Eriksen AV, Möller S, Ryg J. Use of GPT-4 to Diagnose Complex Clinical Cases. Massachusetts Medical Society; 2023.
    1. Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28:31-38. - PubMed
    1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930-1940. - PubMed
    1. Goh E, Gallo R, Hom J, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024;7:e2440969. - PMC - PubMed
    1. Blease CR, Locher C, Gaab J, Hägglund M, Mandl KD. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inf. 2024;31:e101102. - PMC - PubMed

Publication types

LinkOut - more resources