This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Nov 20:arXiv:2410.18856v3.

Demystifying Large Language Models for Medicine: A Primer

Qiao Jin¹, Nicholas Wan¹, Robert Leaman¹, Shubo Tian¹, Zhizheng Wang¹, Yifan Yang¹, Zifeng Wang², Guangzhi Xiong³, Po-Ting Lai¹, Qingqing Zhu¹, Benjamin Hou¹, Maame Sarfo-Gyamfi¹, Gongbo Zhang⁴, Aidan Gilson⁵, Balu Bhasuran⁶, Zhe He⁶, Aidong Zhang³, Jimeng Sun², Chunhua Weng⁴, Ronald M Summers⁷, Qingyu Chen⁵, Yifan Peng⁸, Zhiyong Lu¹

Affiliations

¹ National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
² Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL, USA.
³ Department of Computer Science, University of Virginia, Charlottesville, VA, USA.
⁴ Department of Biomedical Informatics, Columbia University, New York, NY, USA.
⁵ School of Medicine, Yale University, New Haven, CT, USA.
⁶ School of Information, Florida State University, Tallahassee, FL, USA.
⁷ Department of Radiology and Imaging Sciences, NIH Clinical Center, Bethesda, MD, USA.
⁸ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

PMID: 39801619
PMCID: PMC11722506

Demystifying Large Language Models for Medicine: A Primer

Qiao Jin et al. ArXiv. 2024.

[Preprint]. 2024 Nov 20:arXiv:2410.18856v3.

Authors

Affiliations

¹ National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
² Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL, USA.
³ Department of Computer Science, University of Virginia, Charlottesville, VA, USA.
⁴ Department of Biomedical Informatics, Columbia University, New York, NY, USA.
⁵ School of Medicine, Yale University, New Haven, CT, USA.
⁶ School of Information, Florida State University, Tallahassee, FL, USA.
⁷ Department of Radiology and Imaging Sciences, NIH Clinical Center, Bethesda, MD, USA.
⁸ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

PMID: 39801619
PMCID: PMC11722506

Abstract

Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

PubMed Disclaimer

Conflict of interest statement

Disclosures The recommendations in this article are those of the authors and do not necessarily represent the official position of the National Institutes of Health.

Figures

**Figure 1.**
Overview of the proposed systematic approach to utilizing large language models in medicine. Users need to first formulate the medical task and select the LLM accordingly. Then, users can try different prompt engineering approaches with the selected LLM to solve the task. If the results are not satisfying, users can fine-tune the LLMs. After the method development, users also need to consider various factors at the deployment stage. Corresponding best practices in Box 1 are also listed in each phase.

**Figure 2.**
An overview of five common task formulations enabled by LLMs in medicine, with a set of examples. LLMs can answer questions using their domain knowledge and reasoning capabilities. The summarization task shortens long texts into concise summaries. The translation task transforms the source text to the target text in different language styles. The structurization task converts unstructured texts into structured key-value (KV) pairs. LLMs can also be used to support multi-modal data analysis such as interpreting medical images.

**Figure 3.**
Considerations of choosing the LLMs. Users need to choose LLMs that can handle the modality and context length of the selected task. They also need to understand the capabilities of LLMs in the medical task. The gold standards come from manual evaluation of the model’s behavior on real clinical tasks, but this is expensive and time-consuming. Models might be screened for basic medical capabilities with automatic evaluations on medical examinations first, and only models that pass the medical examinations are suitable for clinical evaluation. During the development phase, users need to use the model APIs and (or) local models for better controllability and safety features. Web applications such as online chatbots are suitable for deployment to reach more users. EHR: Electronic medical records. MCQ: multiple-choice questions.

**Figure 4.**
An overview of prompt engineering and fine-tuning techniques. a, Task examples are shown to the model in few-shot learning (FSL). b, Tool learning provides the model with access to external tools like database utilities. c, Chain-of-thought (CoT) prompting instructs the model to generate step-by-step rationale. d, Retrieval-augmented generation (RAG) provides relevant materials to solve the task. e, The patient-to-trial matching task where the patient summary and the clinical trial eligibility criterion are given. f, An overview of fine-tuning, including when and how to perform it. g, The inputs to LLMs are known as “prompts”, and their outputs are “completions”. h, An example output of LLMs that contain the CoT rationale as well as the short answer, organized in the JSON format. n denotes the number of shots for few-shot learning, and N denotes the number of instances for fine-tuning.

See this image and copyright information in PMC

References

1. Achiam J., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
1. Anthropic A. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card (2024).
1. Reid M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024).
1. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024.
1. Brown T., et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

Publication types

Actions

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Demystifying Large Language Models for Medicine: A Primer

Affiliations

Demystifying Large Language Models for Medicine: A Primer

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

LinkOut - more resources

Full Text Sources

This is a preprint.

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

Related information

LinkOut - more resources

Full Text Sources