Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

David B Larson^{1

2}, Arogya Koirala^{1

2}, Lina Y Cheuy^{1

2}, Magdalini Paschali^{1

2}, Dave Van Veen³, Hye Sun Na^{1

2}, Matthew B Petterson¹, Zhongnan Fang^{1

2}, Akshay S Chaudhari^{1

2

4}

Affiliations

¹ Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
² AI Development and Evaluation Laboratory, Stanford University School of Medicine, Palo Alto, Calif.
³ Department of Electrical Engineering, Stanford University, Stanford, Calif.
⁴ Department of Biomedical Data Science, Stanford University, Stanford, Calif.

PMID: 39998369
PMCID: PMC11868845 (available on 2026-02-01)
DOI: 10.1148/radiol.241051

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

David B Larson et al. Radiology. 2025 Feb.

. 2025 Feb;314(2):e241051.

doi: 10.1148/radiol.241051.

Authors

Affiliations

¹ Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
² AI Development and Evaluation Laboratory, Stanford University School of Medicine, Palo Alto, Calif.
³ Department of Electrical Engineering, Stanford University, Stanford, Calif.
⁴ Department of Biomedical Data Science, Stanford University, Stanford, Calif.

PMID: 39998369
PMCID: PMC11868845 (available on 2026-02-01)
DOI: 10.1148/radiol.241051

Abstract

Background Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen κ (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen κ, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean κ, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; P = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; P = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced. © RSNA, 2025 Supplemental material is available for this article.

PubMed Disclaimer

Conflict of interest statement

Disclosures of conflicts of interest: D.B.L. Member of the Board of Chancellors of the American College of Radiology and Board of Trustees of the American Board of Radiology, shareholder in Bunkerhill Health; receives research funding from the Gordon and Betty Moore Foundation. A.K. No relevant relationships. L.Y.C. No relevant relationships. M.P. No relevant relationships. D.V.V. No relevant relationships. H.S.N. No relevant relationships. M.B.P. No relevant relationships. Z.F. No relevant relationships. A.S.C. Grants to institution from the NIH (R01HL167974, R01AR077604, R01EB002524, R01AR079431, and P41EB027060) and contracts (75N92020C00008 and 75N92020C00021), ARPA-H, GE HealthCare, and Philips; royalties or licenses from LVIS though Stanford; consulting fees from Elucid Bioimaging, Patient Square Capital, and Chondrometrics; payment from Genentech for expert testimony; patents planned, issued or pending from GE HealthCare through Stanford; participation on an Advisory Board for Brain Key and Chondrometrics GmbH; stock or stock options in LVIS, Subtle Medical, and Brain Key; Microsoft Azure OpenAI credits.

References

1. Hartung MP , Bickle IC , Gaillard F , Kanne JP . How to Create a Great Radiology Report . RadioGraphics 2020. ; 40 ( 6 ): 1658 – 1670 . - PubMed
1. Yapp KE , Brennan P , Ekpo E . The Effect of Clinical History on Diagnostic Imaging Interpretation - A Systematic Review . Acad Radiol 2022. ; 29 ( 2 ): 255 – 266 . - PubMed
1. Hattori S , Yokota H , Takada T , et al. . Impact of clinical information on CT diagnosis by radiologist and subsequent clinical management by physician in acute abdominal pain . Eur Radiol 2021. ; 31 ( 8 ): 5454 – 5463 . - PubMed
1. Ihuhua P , Pitcher RD . Is the devil in the detail? The quality and clinical impact of information provided on requests for non-trauma emergency abdominal CT scans . Acta Radiol 2016. ; 57 ( 10 ): 1217 – 1222 . - PubMed
1. Finger A , Harris M , Nishimura E , Yoon H-C . Inadequate Clinical Indications in Computed Tomography Chest and Abdomen/Pelvis Scans . Perm J 2018. ; 22 ( 4 ): 18 – 017 . - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Atypon
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

Affiliations

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

Authors

Affiliations

Abstract

Conflict of interest statement

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous