Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Jul 11:27:e71916.
doi: 10.2196/71916.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline

Affiliations
Review

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline

HongYi Li et al. J Med Internet Res. .

Abstract

Background: Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work.

Objective: We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care.

Methods: We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies.

Results: We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings.

Conclusions: In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.

Keywords: AI; LLM; LLM review; artificial intelligence; clinical; digital health; large language model.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Clinical large language model (LLM) selector—schematic description of the clinical LLM selector tree for clinicians to select LLMs suitable to their needs. API: application programming interface; GPU: graphics processing unit; TDP: thermal design power.
Figure 2
Figure 2
Integration of large language models (LLMs) in a hospital information system (HIS) in a 5-stage clinical workflow. HIS subsystem modules interact with 5 stages of a typical clinical workflow: registration and department guidance (stage 1), prediagnosis and examination (stage 2), diagnosis and treatment planning (stage 3), treatment and hospitalization (stage 4), and discharge and follow-up (stage 5). Patients registered are shown in orange and are shown in green when discharged. CDSS: clinical decision support system; EHR: electronic health record; LIS: laboratory information system; MAR: medication administration record; NCS: nurse call system; PACS: picture archiving and communication system; PMS: practice management system; RIS: radiology information system; robot symbol: possible clinical applications of LLMs (or multimodal LLMs).
Figure 3
Figure 3
Literature review screening process summary. The literature review process aimed at identifying articles that were relevant to large language models (LLMs) used in clinical work. A total of 15,699 articles were obtained using a keyword-combination search strategy. Following a selection process putting emphasis on the selection of innovative clinical applications of LLMs, we kept a total of 270 papers.
Figure 4
Figure 4
Log-transformed frequency of large language models (LLMs) by clinical stage, task (A-H), and subtask. Each panel shows the log-transformed number of studies using each LLM for specific subtasks within a 5-stage clinical workflow. Bar height reflects study count per LLM per subtask and stage. Colors indicate 56 subtask categories across stages. For example, panel F (medical image processing) spans stages 2 and 3, where stage 3 includes the image segmentation subtask for which MedVersa and LLMSeg were used in 1 study. ECG: electrocardiography; ICD: International Classification of Diseases.
Figure 5
Figure 5
Best-performing large language models (LLMs) by clinical stage, task (A-H), and subtask. Each panel shows the log-transformed frequency of LLMs that performed best for specific subtasks within a 5-stage clinical workflow. Bar height indicates how often an LLM performed best for each subtask, under a given stage. Colors denote 56 subtask categories across stages. For example, panel F (medical image processing) covers stages 2 and 3, with stage 3 including the image segmentation subtask where MedVersa and LLMSeg each ranked best in 1 study. ECG: electrocardiography; ICD: International Classification of Diseases.
Figure 6
Figure 6
Summary of input-output combinations and access paths for best-performing large language models (LLMs). Panels A to M show log-transformed counts of subtasks for which each LLM (with model size annotated) performed best under specific input-output combinations based on original study data. Inputs are grouped as mandatory (first) and optional (second); the dash (–) denotes absence of optional input. Bar height reflects the (log) count of subtasks for which an LLM performed best, and fill patterns indicate access paths—application programming interface (API; hachures), restricted LLMs (dots), and open-source LLMs (no filling). ECG: electrocardiography.
Figure 7
Figure 7
Computational resources and costs of the best-performing large language models (LLMs). Panels A to C show graphics processing unit (GPU) memory, thermal design power (TDP), and price (log-transformed in gray) for each LLM (annotated on the bars) based on literature sources. GPU specifications are from NVIDIA documentation; the missing memory for A100 or A800 is assumed as 40 GB. For GPUs that had both PCIe and SXM architectures, we assumed the use of PCIe with fluctuations in the TDP data. Prices are from eBay (April 2024) and for reference only. For LLMs using multiple GPUs, memory, TDP, and cost are summed. The bar color is associated with the pretraining (dark gray), fine-tuning (gray), and inference (light gray) stages in which the GPU is activated. Numbers 1 to 8 represent clinical task categories.
Figure 8
Figure 8
Modality and access paths of the best-performing large language models (LLMs). Bars show the log-transformed number of subtasks for which each LLM (with name and model size labeled) performed best categorized by single-modal (Single) and multimodal (Multi) input and output based on a literature review. We excluded LLMs that did not perform best in any subtask. Bar patterns represent application programming interface (API; stripes), restricted LLMs (dotted area), and open-source LLMs (blank).

References

    1. Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. 2023 Aug 17;29(8):1930–40. doi: 10.1038/s41591-023-02448-8.10.1038/s41591-023-02448-8 - DOI - PubMed
    1. Vaid A, Landi I, Nadkarni G, Nabeel I. Using fine-tuned large language models to parse clinical notes in musculoskeletal pain disorders. Lancet Digit Health. 2023 Oct 26;5(12):e855. doi: 10.1016/S2589-7500(23)00202-9. https://linkinghub.elsevier.com/retrieve/pii/S2589-7500(23)00202-9 S2589-7500(23)00202-9 - DOI - PubMed
    1. Rau A, Rau S, Zoeller D, Fink A, Tran H, Wilpert C, Nattenmueller J, Neubauer J, Bamberg F, Reisert M, Russe MF. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology. 2023 Jul 01;308(1):e230970. doi: 10.1148/radiol.230970. - DOI - PubMed
    1. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023 Aug 12;620(7972):172–80. doi: 10.1038/s41586-023-06291-2. https://europepmc.org/abstract/MED/37438534 10.1038/s41586-023-06291-2 - DOI - PMC - PubMed
    1. Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, Alsentzer E, de Jong J, Patra A, Kohane I. Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study. Lancet Digit Health. 2023 Dec;5(12):e882–94. doi: 10.1016/s2589-7500(23)00179-6. - DOI - PMC - PubMed

LinkOut - more resources