ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Chung-You Tsai^{1

2}, Pai-Yu Cheng^{3

4}, Juinn-Horng Deng², Fu-Shan Jaw³, Shyi-Chun Yii^{3

4}

Affiliations

¹ Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, New Taipei.
² Department of Electrical Engineering, Yuan Ze University, Taoyuan.
³ Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan.
⁴ Department of Surgery, Far Eastern Memorial Hospital, New Taipei.

PMID: 39148811
PMCID: PMC11325467
DOI: 10.1177/20552076241269538

ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Chung-You Tsai et al. Digit Health. 2024.

. 2024 Aug 14:10:20552076241269538.

doi: 10.1177/20552076241269538. eCollection 2024 Jan-Dec.

Authors

Chung-You Tsai^{1

2}, Pai-Yu Cheng^{3

4}, Juinn-Horng Deng², Fu-Shan Jaw³, Shyi-Chun Yii^{3

4}

Affiliations

¹ Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, New Taipei.
² Department of Electrical Engineering, Yuan Ze University, Taoyuan.
³ Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan.
⁴ Department of Surgery, Far Eastern Memorial Hospital, New Taipei.

PMID: 39148811
PMCID: PMC11325467
DOI: 10.1177/20552076241269538

Abstract

Objectives: To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.

Methods: Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.

Results: ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (p < 0.001) and 6.1 versus 3.9 RECs (p < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, p = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.

Conclusions: ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.

Keywords: Artificial intelligence; ChatGPT; bladder; cancers; kidney; patient information; prostate.

PubMed Disclaimer

Conflict of interest statement

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

**Figure 1.**
Performance indicator variability in assessing ChatGPT's recommendations against NCCN guidelines across two scenarios. Panels (a) and (b) display the discrepancy in cancer treatment recommendations (RECs) by ChatGPT relative to NCCN guidelines for two distinct cancer queries as assessed by a single rater. (a) The scenario where ChatGPT's RECs are more than NCCN RECs. (b) The scenario where ChatGPT's RECs are fewer than NCCN RECs.

**Figure 2.**
Comparison of Treatment Recommendations for Prostate Cancer: ChatGPT-3.5 vs. ChatGPT-4. Panels (a) to (d) illustrate the differences in performance between two ChatGPT models when queried about prostate cancer using 32 unique prompts: (a) Response word count. (b) Number of treatment recommendations (RECs) provided by ChatGPT. per query. (c) The concordance rate of recommendations was evaluated by four performance indicators. (d) Quality assessments on a 5-point scale (1-5) in four dimensions: correctness, comprehensiveness, specificity, and appropriateness. ChatGPT-4 (red) significantly outperformed ChatGPT-3.5 (green) in most measured aspects. All bar charts present mean values with standard deviations. Significant differences (p < 0.01) between the two models are indicated by double asterisks (**).

**Figure 3.**
ChatGPT-4's overall concordance rate and quality assessments using 108 prompts. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about prostate, kidney, and bladder cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators: (1) Rater-approved ChatGPT REC ratio (based on total ChatGPT RECs). (2) NCCN-aligned ChatGPT REC ratio (based on total ChatGPT RECs). (3) ChatGPT RECs/ NCCN REC ratio. (4) NCCN-aligned ChatGPT RECs/NCCN REC ratio. Quality assessments were evaluated on a 5-point scale (1-5) in four dimensions: correctness, comprehensiveness, specificity, and appropriateness.

**Figure 4.**
Subgroup analysis stratified by cancer type: concordance of ChatGPT-4's treatment recommendations. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about prostate, kidney, and bladder cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators: (1) Rater-approved ChatGPT REC ratio (based on total ChatGPT RECs). (2) NCCN-aligned ChatGPT REC ratio (based on total ChatGPT RECs). (3) ChatGPT REC/ NCCN REC ratio. (4) NCCN-aligned ChatGPT REC/NCCN RECsratio. The values in the bar chart are presented as mean values. A significant difference (p < 0.05) between the cancer types is indicated by an asterisk (*).

**Figure 5.**
Subgroup analysis stratified by disease status: concordance of ChatGPT-4's treatment recommendations. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about localized, systemic, and recurrent cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators. The values in the bar chart are presented as mean values. A significant difference (p < 0.05) between the disease statuses is indicated by an asterisk (*).

See this image and copyright information in PMC

References

1. Gordijn B, Have HT. ChatGPT: evolution or revolution? Med Health Care Philos 2023; 26: 1–2. - PubMed
1. Talyshinskii A, Naik N, Hameed BZ, et al. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10: 1257191. - PMC - PubMed
1. Tung JY, Lim DY, Sng GG. Potential safety concerns in use of the artificial intelligence chatbot ‘ChatGPT' for perioperative patient communication. BJU Int 2023; 132: 157–159. - PubMed
1. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023; 21: 1–4. - PMC - PubMed
1. Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology 2023; 177: 29–33. - PubMed

LinkOut - more resources

Full Text Sources
- Atypon
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Affiliations

ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources