Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Mar:9:e2400230.
doi: 10.1200/CCI-24-00230. Epub 2025 Mar 20.

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations

Affiliations
Comparative Study

Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations

Loic Ah-Thiane et al. JCO Clin Cancer Inform. 2025 Mar.

Abstract

Purpose: To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.

Materials and methods: Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.

Results: The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo (P = .85), but both tended to perform better than LLaMa3-70B (P = .027 and P = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.

Conclusion: LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.

PubMed Disclaimer

Conflict of interest statement

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Loic Ah-thiane

Travel, Accommodations, Expenses: Novartis, Astellas Pharma

Pierre-Etienne Heudel

Stock and Other Ownership Interests: GEODAISICS

Honoraria: Pfizer, Novartis (Inst), Seagen, Pierre Fabre, Amgen, AstraZeneca (Inst), Roche, Mylan, Gilead Sciences

Consulting or Advisory Role: Novartis (Inst), Seagen, Lilly

Research Funding: Fresenius (Inst)

Travel, Accommodations, Expenses: Pfizer, Novartis, Roche, Lilly

Mario Campone

Honoraria: Pfizer (Inst)

Consulting or Advisory Role: Pfizer (Inst)

Speakers' Bureau: Novartis (Inst), Lilly (Inst)

Travel, Accommodations, Expenses: Pfizer, Novartis, AstraZeneca

Marie Robert

Consulting or Advisory Role: Daiichi Sankyo/AstraZeneca

Travel, Accommodations, Expenses: Gilead Sciences, Novartis

Stéphane Supiot

Employment: AstraZeneca, Janssen, Ipsen, Astellas Pharma, Bayer, Ferring, Novartis

Honoraria: MSD Oncology, Novartis, Curium Pharma, AstraZeneca

Research Funding: Janssen (Inst), Astellas Medivation (Inst)

Travel, Accommodations, Expenses: Ipsen

Other Relationship: Janssen Oncology, Astellas Pharma

Jean-Sébastien Frenel

Consulting or Advisory Role: Novartis (Inst), Pfizer, Lilly, AstraZeneca (Inst), Daiichi Sankyo Europe GmbH (Inst), GlaxoSmithKline, Amgen, Seagen, Gilead Sciences, Clovis Oncology (Inst), MSD Oncology (Inst), Exact Sciences (Inst), Eisai, AbbVie

Travel, Accommodations, Expenses: Novartis, Lilly, Pfizer, Daiichi Sankyo Europe GmbH, AstraZeneca, Gilead Sciences, Seagen, MSD Oncology

No other potential conflicts of interest were reported.

Figures

FIG 1.
FIG 1.
Illustration of the automation workflow to process medical records into LLMs. Each selected case from our databases represents an input into LLMs, which output suggestions about the treatment to be compared with decisions emitted by experts. The task was enunciated with the following prompt (translated from French): “You are an expert in oncology, particularly in breast cancer. I need you to review the attached medical file of a patient who needs to be treated. Based on your knowledge and skills in oncology, I want you to decide on the recommended complementary treatments. The first step is to tell me if the surgery is complete and validated by saying ‘validated surgery.’ If the surgery is validated, the second step is to list all the additional treatments indicated, including chemotherapy, anti–HER2-targeted therapy, endocrine therapy and radiotherapy.” HER2, human epidermal growth factor receptor 2; LLM, large language models.
FIG 2.
FIG 2.
Repartition of the suggested treatments according to the three LLMs. (A) The number of suggestions that were concordant, acceptable, or unacceptable compared with the decisions emitted by human experts. (B) The rates of appropriate suggestions (which are the sum of concordant and acceptable) that represented our primary end point. The rates of appropriate suggestions appeared higher with Claude3-Opus and GPT4-Turbo in comparison with LLaMa3-70B. LLMs, large language models.
FIG 3.
FIG 3.
Performances of the LLMs to appropriately suggest each treatment modality. (A) The F1-score and (B) specificity of the three LLMs according to each treatment modality. F1-score (on the basis of recall and precision) reflects the capacity to correctly identify patients requiring a specific treatment, whereas specificity reflects the capacity to correctly identify patients not indicated for a specific treatment. All three models were perfectly accurate for indications of adjuvant targeted therapy and endocrine therapy. Differences between the models were more visible for chemotherapy and genomic test indications, and for radiotherapy to a lesser extent. LLM, large language models.
FIG A1.
FIG A1.
Recall and precision of the LLMs. This figure shows the recall, also known as sensitivity (above) and precision (below) of each model. These two metrics are including in the the F1-score according to the following formula: (2 x recall x precision)/(recall + precision). LLMs, large language models.

References

    1. Chi EA, Chi G, Tsui CT, et al. : Development and validation of an artificial intelligence system to optimize clinician review of patient records. JAMA Netw Open 4:e2117391, 2021 - PMC - PubMed
    1. Jabbour S, Fouhey D, Shepard S, et al. : Measuring the impact of AI in the diagnosis of hospitalized patients: A randomized clinical vignette survey study. JAMA 330:2275-2284, 2023 - PMC - PubMed
    1. Goodman KE, Yi PH, Morgan DJ: AI-generated clinical summaries require more than accuracy. JAMA 331:637-638, 2024 - PubMed
    1. Han R, Acosta JN, Shakeri Z, et al. : Randomised controlled trials evaluating artificial intelligence in clinical practice: A scoping review. Lancet Digit Health 6:e367-e373, 2024 - PMC - PubMed
    1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. : Large language models in medicine. Nat Med 29:1930-1940, 2023 - PubMed

Publication types