This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 May 22:rs.3.rs-4240043.

doi: 10.21203/rs.3.rs-4240043/v1.

Me-LLaMA: Foundation Large Language Models for Medical Applications

Affiliations

¹ Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA.
² Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
³ School of Biomedical Informatics, University of Texas Health Science, Center at Houston, Houston, TX, USA.

PMID: 38826372
PMCID: PMC11142305
DOI: 10.21203/rs.3.rs-4240043/v1

Me-LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie et al. Res Sq. 2024.

[Preprint]. 2024 May 22:rs.3.rs-4240043.

doi: 10.21203/rs.3.rs-4240043/v1.

Authors

Affiliations

¹ Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA.
² Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
³ School of Biomedical Informatics, University of Texas Health Science, Center at Houston, Houston, TX, USA.

PMID: 38826372
PMCID: PMC11142305
DOI: 10.21203/rs.3.rs-4240043/v1

Abstract

Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTEREST The authors have no financial or non-financial conflicts of interest to disclose.

Figures

**Figure 1.**
The comparison of the peak performance of Me-LLaMA models and Meditron 70B in the few-shot setting.

**Figure 2.**
The comparison of performance of Me-LLaMA models in both zero-shot and task-specific instruction fine-tuning settings, against the zero-shot performance of ChatGPT and GPT-4.

**Figure 3.**
Overview of the study. We first developed the Me-LLaMA base models by continual pre-training LLaMA2 models with mixed pre-training data (stage 1). Me-LLaMA-chat models are further developed by instruction fine-tuning Me-LLaMA base models (stage 2). We further fine-tuned Me-LLaMA base models with the task-specific training sets of evaluation datasets (stage 3) to evaluate their performance in the supervised learning setting and evaluate the performance of Me-LLaMA-chat models in the zero/few-shot setting (stage 4).

See this image and copyright information in PMC

References

1. Wei J. et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
1. Bubeck S. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
1. Nori H., King N., McKinney S. M., Carignan D. & Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023).
1. Bengio Y., Ducharme R. & Vincent P. A neural probabilistic language model. Advances neural information processing systems 13 (2000).
1. Brown T. et al. Language models are few-shot learners. Advances neural information processing systems 33, 1877–1901 (2020).

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Me-LLaMA: Foundation Large Language Models for Medical Applications

Affiliations

Me-LLaMA: Foundation Large Language Models for Medical Applications

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous