Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 14:10:20552076241269538.
doi: 10.1177/20552076241269538. eCollection 2024 Jan-Dec.

ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Affiliations

ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance

Chung-You Tsai et al. Digit Health. .

Abstract

Objectives: To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.

Methods: Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.

Results: ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (p < 0.001) and 6.1 versus 3.9 RECs (p < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, p = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.

Conclusions: ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.

Keywords: Artificial intelligence; ChatGPT; bladder; cancers; kidney; patient information; prostate.

PubMed Disclaimer

Conflict of interest statement

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Performance indicator variability in assessing ChatGPT's recommendations against NCCN guidelines across two scenarios. Panels (a) and (b) display the discrepancy in cancer treatment recommendations (RECs) by ChatGPT relative to NCCN guidelines for two distinct cancer queries as assessed by a single rater. (a) The scenario where ChatGPT's RECs are more than NCCN RECs. (b) The scenario where ChatGPT's RECs are fewer than NCCN RECs.
Figure 2.
Figure 2.
Comparison of Treatment Recommendations for Prostate Cancer: ChatGPT-3.5 vs. ChatGPT-4. Panels (a) to (d) illustrate the differences in performance between two ChatGPT models when queried about prostate cancer using 32 unique prompts: (a) Response word count. (b) Number of treatment recommendations (RECs) provided by ChatGPT. per query. (c) The concordance rate of recommendations was evaluated by four performance indicators. (d) Quality assessments on a 5-point scale (1-5) in four dimensions: correctness, comprehensiveness, specificity, and appropriateness. ChatGPT-4 (red) significantly outperformed ChatGPT-3.5 (green) in most measured aspects. All bar charts present mean values with standard deviations. Significant differences (p < 0.01) between the two models are indicated by double asterisks (**).
Figure 3.
Figure 3.
ChatGPT-4's overall concordance rate and quality assessments using 108 prompts. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about prostate, kidney, and bladder cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators: (1) Rater-approved ChatGPT REC ratio (based on total ChatGPT RECs). (2) NCCN-aligned ChatGPT REC ratio (based on total ChatGPT RECs). (3) ChatGPT RECs/ NCCN REC ratio. (4) NCCN-aligned ChatGPT RECs/NCCN REC ratio. Quality assessments were evaluated on a 5-point scale (1-5) in four dimensions: correctness, comprehensiveness, specificity, and appropriateness.
Figure 4.
Figure 4.
Subgroup analysis stratified by cancer type: concordance of ChatGPT-4's treatment recommendations. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about prostate, kidney, and bladder cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators: (1) Rater-approved ChatGPT REC ratio (based on total ChatGPT RECs). (2) NCCN-aligned ChatGPT REC ratio (based on total ChatGPT RECs). (3) ChatGPT REC/ NCCN REC ratio. (4) NCCN-aligned ChatGPT REC/NCCN RECsratio. The values in the bar chart are presented as mean values. A significant difference (p < 0.05) between the cancer types is indicated by an asterisk (*).
Figure 5.
Figure 5.
Subgroup analysis stratified by disease status: concordance of ChatGPT-4's treatment recommendations. This bar chart displays the concordance rates of ChatGPT-4's treatment recommendations (RECs) when queried about localized, systemic, and recurrent cancers using 108 unique prompts. Concordance rates were evaluated across four performance indicators. The values in the bar chart are presented as mean values. A significant difference (p < 0.05) between the disease statuses is indicated by an asterisk (*).

References

    1. Gordijn B, Have HT. ChatGPT: evolution or revolution? Med Health Care Philos 2023; 26: 1–2. - PubMed
    1. Talyshinskii A, Naik N, Hameed BZ, et al. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10: 1257191. - PMC - PubMed
    1. Tung JY, Lim DY, Sng GG. Potential safety concerns in use of the artificial intelligence chatbot ‘ChatGPT' for perioperative patient communication. BJU Int 2023; 132: 157–159. - PubMed
    1. Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 2023; 21: 1–4. - PMC - PubMed
    1. Deebel NA, Terlecki R. ChatGPT performance on the American Urological Association (AUA) Self-Assessment Study Program and the potential influence of artificial intelligence (AI) in urologic training. Urology 2023; 177: 29–33. - PubMed

LinkOut - more resources