Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations
- PMID: 40112233
- PMCID: PMC11949217
- DOI: 10.1200/CCI-24-00230
Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations
Abstract
Purpose: To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.
Materials and methods: Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.
Results: The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo (P = .85), but both tended to perform better than LLaMa3-70B (P = .027 and P = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.
Conclusion: LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.
Conflict of interest statement
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (
No other potential conflicts of interest were reported.
Figures




References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous