Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 1;30(10):oyaf293.
doi: 10.1093/oncolo/oyaf293.

Large language model processing capabilities of ChatGPT 4.0 to generate molecular tumor board recommendations-a critical evaluation on real world data

Affiliations

Large language model processing capabilities of ChatGPT 4.0 to generate molecular tumor board recommendations-a critical evaluation on real world data

Maximilian Schmutz et al. Oncologist. .

Abstract

Background: Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0's performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.

Methods: We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss' Kappa).

Results: ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs 1, P = .005), with comparable diagnostic suggestions (median 1 vs 2, P = .501). Therapeutic scope from ChatGPT 4.0 included off-label and clinical trial options. IDM scores indicated similar content depth between ChatGPT 4.0 (median 0.67) and hMTB (median 0.75; P = .084). Moderate consistency was observed across replicate runs (median Fleiss' Kappa = 0.51). ChatGPT 4.0 occasionally utilized lower-level or preclinical evidence more frequently (P = .0019). Efficiency favored ChatGPT 4.0 significantly (median 15.2 vs 34.7 minutes; P < .001).

Conclusion: Incorporating ChatGPT 4.0 into MTB workflows enhances efficiency and provides relevant recommendations, especially in guideline-supported cases. However, variability in evidence prioritization highlights the need for ongoing human oversight. A hybrid approach, integrating human expertise with LLM support, may optimize precision oncology decision-making.

Keywords: ChatGPT 4.0; artificial intelligence; large language models; molecular tumor board; precision oncology; variant annotation.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to declare.

Figures

Figure 1.
Figure 1.
Analytical strategy: ChatGPT 4.0 and human oncological experts from the molecular tumor board (hMTB) generated recommendations for standard cases as baseline feasibility and for 20 anonymized patient cases from the molecular tumor board. Recommendations were then evaluated by 2 independent human reviewers focusing on type of recommendation, information density, consistency of information, quality of information and process efficiency. LLM, large language model.
Figure 2.
Figure 2.
(A) Illustrates the distribution of recommendations across the cases. The top panel of the figure represents the recommendations made by human experts, while the bottom panel shows the averaged recommendations from the GPT model. Therapeutic recommendations are depicted in orange, and diagnostic recommendations are depicted in red. Error bars in the GPT panel indicate the standard deviation across the triplicates. Statistical analysis was conducted by performing the Wilcoxon signed-rank test on the therapeutic and diagnostic recommendations separately. (B) Shows the Bland-Altman Plots both for therapeutic and diagnostic recommendations. Orange dots represent a single case, with its position on the plot showing both the mean of the GPT and expert recommendations on the x-axis and the difference between them on the y-axis. The central red dashed line represents the mean difference (bias) between the 2 methods, while the black dashed lines indicate the 95% limits of agreement, calculated as the mean difference ±1.96 times the standard deviation of the differences.
Figure 3.
Figure 3.
The line plot shows IDM scores assigned by 2 independent reviewers for MTB recommendations generated by GPT-4.0 (with each reviewer evaluating 3 sets of recommendations [triplicates]) (A) and by a human expert group across 20 cases (B). A Wilcoxon signed-rank test revealed no statistically significant differences between the overall depth of information provided by GPT-4.0 and the human expert group (P = .546). IDM, information density metric; MTB, molecular tumor board.
Figure 4.
Figure 4.
(A) Bar chart displaying Fleiss’ Kappa values for each clinical case, indicating the level of agreement between different GPT-4.0 accounts in assigning therapy recommendations across replications. (B) Boxplot summarizing the distribution of Fleiss’ Kappa values across all cases, highlighting the overall moderate agreement with a mean Kappa value of 0.51.
Figure 5.
Figure 5.
Comparison of Levels of Evidence (LoE) Assigned to Therapy Recommendations by ChatGPT 4.0 and Human Experts. (A) Bar plots showing the distribution of Levels of Evidence (LoE) for therapy recommendations generated by GPT 4.0 across 3 replicates and by human experts. LoEs range from 1A (strongest evidence) to 4 (weakest evidence). The plots compare the frequency of each LoE category assigned by 2 independent reviewers for GPT 4.0 and human expert recommendations. (B) Average differences in Levels of Evidence (LoE) scores between GPT 4.0-generated therapeutic recommendations and human molecular tumor board (hMTB) recommendations across 20 cases. Bars represent the mean difference in LoE scores, with positive values indicating higher scores assigned by GPT. Error bars denote the standard deviation across 3 replicates. Cases marked with “data incompl.” indicate insufficient data for more than one replicate or missing data for the hMTB group for both reviewers.
Figure 6.
Figure 6.
Temporal benchmarking of LLM performance in MTB recommendations. Information Density Metric (IDM) scores for 4 representative cases (Case 2, Case 9, Case 18, and Case 19) across 4 ChatGPT model versions: GPT-4.0 (baseline), GPT-4o (August 2024), GPT-4o (August 2025), and GPT-4o with Deep Research mode (August 2025). Cases were selected to reflect a spectrum of complexity, initial performance, and reviewer concordance. Bars represent mean IDM scores from triplicate outputs per model version. Temporal progression shows reduced performance for GPT-4o (Aug 2024) compared to baseline, followed by marked improvement in GPT-4o (Aug 2025) and Deep Research outputs, the latter providing textually rich, evidence-cited recommendations. LLM, large language model; MTB, molecular tumor board.
Figure 7.
Figure 7.
Time required for preparing MTB cases. Boxplot comparing the time required (in minutes) to prepare MTB (molecular tumor board) cases using a ChatGPT-assisted approach vs human experts. The plot shows that the ChatGPT-assisted approach required significantly less time than the human experts, with the distribution of times displayed for both groups. The P-value of less than .001 indicates a significant difference in preparation time between the 2 methods.

References

    1. Riedl JM, Moik F, Esterl T, et al. Molecular diagnostics tailoring personalized cancer therapy-an oncologist’s view. Virchows Arch. 2024;484:169-179. 10.1007/s00428-023-03702-7 - DOI - PMC - PubMed
    1. Strickler JH, Satake H, George TJ, et al. Sotorasib in KRAS p.G12C-mutated advanced pancreatic cancer. N Engl J Med. 2023;388:33-43. 10.1056/NEJMoa2208470 - DOI - PMC - PubMed
    1. Subbiah V, Kurzrock R. Challenging standard-of-care paradigms in the precision oncology era. Trends Cancer. 2018;4:101-109. 10.1016/j.trecan.2017.12.004 - DOI - PMC - PubMed
    1. Hyman DM, Puzanov I, Subbiah V, et al. Vemurafenib in multiple nonmelanoma cancers with BRAF V600 mutations. N Engl J Med. 2015;373:726-736. 10.1056/NEJMoa1502309 - DOI - PMC - PubMed
    1. Mosele F, Remon J, Mateo J, et al. Recommendations for the use of next-generation sequencing (NGS) for patients with metastatic cancers: a report from the ESMO precision medicine working group. Ann Oncol. 2020;31:1491-1505. 10.1016/j.annonc.2020.07.014 - DOI - PubMed