Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 10;12(12):1345.
doi: 10.3390/bioengineering12121345.

An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Affiliations

An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Wing-Keen Yap et al. Bioengineering (Basel). .

Abstract

Background: Laryngeal and hypopharyngeal cancers (LHCs) exhibit heterogeneous outcomes after definitive radiotherapy (RT). Large language models (LLMs) may enhance prognostic stratification by integrating complex clinical and imaging data. This study validated two pre-trained LLMs-GPT-4o-2024-08-06 and Gemma-2-27b-it-for outcome prediction in LHC. Methods: Ninety-two patients with non-metastatic LHC treated with definitive (chemo)radiotherapy at Linkou Chang Gung Memorial Hospital (2006-2013) were retrospectively analyzed. First-order and 3D radiomic features were extracted from intra- and peritumoral regions on pre- and mid-RT CT scans. LLMs were prompted with clinical variables, radiotherapy notes, and radiomic features to classify patients as high- or low-risk for death, recurrence, and distant metastasis. Model performance was assessed using sensitivity, specificity, AUC, Kaplan-Meier survival analysis, and McNemar tests. Results: Integration of radiomic features significantly improved prognostic discrimination over clinical/RT plan data alone for both LLMs. For death prediction, pre-RT radiomics were the most predictive: GPT-4o achieved a peak AUC of 0.730 using intratumoral features, while Gemma-2-27b reached 0.736 using peritumoral features. For recurrence prediction, mid-RT peritumoral features yielded optimal performance (AUC = 0.703 for GPT-4o; AUC = 0.709 for Gemma-2-27b). Kaplan-Meier analyses confirmed statistically significant separation of risk groups: pre-RT intra- and peritumoral features for overall survival (for both GPT-4o and Gemma-2-27b, p < 0.05), and mid-RT peritumoral features for recurrence-free survival (p = 0.028 for GPT-4o; p = 0.017 for Gemma-2-27b). McNemar tests revealed no significant performance difference between the two LLMs when augmented with radiomics (all p > 0.05), indicating that the open-source model achieved comparable accuracy to its proprietary counterpart. Both models generated clinically coherent, patient-specific rationales explaining risk assignments, enhancing interpretability and clinical trust. Conclusions: This external validation demonstrates that pre-trained LLMs can serve as accurate, interpretable, and multimodal prognostic engines for LHC. Pre-RT radiomic features are critical for predicting mortality and metastasis, while mid-RT peritumoral features uniquely inform recurrence risk. The comparable performance of the open-source Gemma-2-27b-it model suggests a scalable, cost-effective, and privacy-preserving pathway for the integration of LLM-based tools into precision radiation oncology workflows to enhance risk stratification and therapeutic personalization.

Keywords: hypopharyngeal cancer; large language models; laryngeal cancer; prognosis; radiomics; survival prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1
Figure 1
The McNemar test assessed the significance of differences between paired models for overall survival prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.
Figure 2
Figure 2
The McNemar test assessed the significance of differences between paired models for recurrence prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.
Figure 3
Figure 3
The McNemar test assessed the significance of differences between paired models for distant metastasis prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.
Figure 4
Figure 4
Kaplan–Meier survival curves for overall survival stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.
Figure 5
Figure 5
Kaplan–Meier survival curves for recurrence stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.
Figure 6
Figure 6
Kaplan–Meier survival curves for distant metastasis stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.
Figure 7
Figure 7
Kaplan–Meier survival curves for overall survival stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.
Figure 8
Figure 8
Kaplan–Meier survival curves for recurrence stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.
Figure 9
Figure 9
Kaplan–Meier survival curves for distant metastasis stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

References

    1. Amar A., de Almeida J.R., Kanda J.L., de Paula S.M.T., Lessa M.M. Epidemiological assessment and therapeutic response in hypopharyngeal cancer. Braz. J. Otorhinolaryngol. 2013;79:500–504. doi: 10.5935/1808-8694.20130089. - DOI - PMC - PubMed
    1. Luo X., Yu F., Xu C., Deng Z., Zeng Y., Zhao X., Zeng X. Evaluation of the prevalence of metachronous second primary malignancies in hypopharyngeal carcinoma and their effect on outcomes. Cancer Med. 2022;11:1059–1067. doi: 10.1002/cam4.4501. - DOI - PMC - PubMed
    1. Bray F., Laversanne M., Sung H., de Martel C., Ferlay J., Brooks F., Mery L. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024;74:229–263. doi: 10.3322/caac.21834. - DOI - PubMed
    1. Baumann R., Linge A., Zips D. Targeting hypoxia to overcome radiation resistance in head & neck cancers: Real challenge or clinical fairytale? Expert Rev. Anticancer Ther. 2016;16:751–758. - PubMed
    1. Huang G., Pan S.T. ROS-Mediated Therapeutic Strategy in Chemo-/Radiotherapy of Head and Neck Cancer. Oxidative Med. Cell. Longev. 2020;2020:5047987. doi: 10.1155/2020/5047987. - DOI - PMC - PubMed

LinkOut - more resources