Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Nov 20:arXiv:2410.18856v3.

Demystifying Large Language Models for Medicine: A Primer

Affiliations

Demystifying Large Language Models for Medicine: A Primer

Qiao Jin et al. ArXiv. .

Abstract

Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

PubMed Disclaimer

Conflict of interest statement

Disclosures The recommendations in this article are those of the authors and do not necessarily represent the official position of the National Institutes of Health.

Figures

Figure 1.
Figure 1.
Overview of the proposed systematic approach to utilizing large language models in medicine. Users need to first formulate the medical task and select the LLM accordingly. Then, users can try different prompt engineering approaches with the selected LLM to solve the task. If the results are not satisfying, users can fine-tune the LLMs. After the method development, users also need to consider various factors at the deployment stage. Corresponding best practices in Box 1 are also listed in each phase.
Figure 2.
Figure 2.
An overview of five common task formulations enabled by LLMs in medicine, with a set of examples. LLMs can answer questions using their domain knowledge and reasoning capabilities. The summarization task shortens long texts into concise summaries. The translation task transforms the source text to the target text in different language styles. The structurization task converts unstructured texts into structured key-value (KV) pairs. LLMs can also be used to support multi-modal data analysis such as interpreting medical images.
Figure 3.
Figure 3.
Considerations of choosing the LLMs. Users need to choose LLMs that can handle the modality and context length of the selected task. They also need to understand the capabilities of LLMs in the medical task. The gold standards come from manual evaluation of the model’s behavior on real clinical tasks, but this is expensive and time-consuming. Models might be screened for basic medical capabilities with automatic evaluations on medical examinations first, and only models that pass the medical examinations are suitable for clinical evaluation. During the development phase, users need to use the model APIs and (or) local models for better controllability and safety features. Web applications such as online chatbots are suitable for deployment to reach more users. EHR: Electronic medical records. MCQ: multiple-choice questions.
Figure 4.
Figure 4.
An overview of prompt engineering and fine-tuning techniques. a, Task examples are shown to the model in few-shot learning (FSL). b, Tool learning provides the model with access to external tools like database utilities. c, Chain-of-thought (CoT) prompting instructs the model to generate step-by-step rationale. d, Retrieval-augmented generation (RAG) provides relevant materials to solve the task. e, The patient-to-trial matching task where the patient summary and the clinical trial eligibility criterion are given. f, An overview of fine-tuning, including when and how to perform it. g, The inputs to LLMs are known as “prompts”, and their outputs are “completions”. h, An example output of LLMs that contain the CoT rationale as well as the short answer, organized in the JSON format. n denotes the number of shots for few-shot learning, and N denotes the number of instances for fine-tuning.

Similar articles

References

    1. Achiam J., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
    1. Anthropic A. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card (2024).
    1. Reid M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024).
    1. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024.
    1. Brown T., et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

Publication types

LinkOut - more resources