An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Wing-Keen Yap¹, Shih-Chun Cheng², Chia-Hsin Lin^{1

3}, Ing-Tsung Hsiao⁴, Tsung-You Tsai⁵, Wing-Lake Yap⁶, Willy Po-Yuan Chen¹, Chien-Yu Lin¹, Shih-Ming Huang⁷

Affiliations

¹ Department of Radiation Oncology, Proton and Radiation Therapy Center, Linkou Chang Gung Memorial Hospital, College of Medicine, Chang Gung University, Kwei-Shan, Taoyuan 333, Taiwan.
² Department of Medical Imaging and Radiological Sciences, College of Medicine, Chang Gung University, Taoyuan 333, Taiwan.
³ UTHealth Graduate School of Biomedical Sciences, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
⁴ Department of Medical Imaging and Radiological Sciences, Healthy Aging Research Center, Chang Gung University, Taoyuan 333, Taiwan.
⁵ Department of Otolaryngology-Head and Neck Surgery, Linkou Chang Gung Memorial Hospital, College of Medicine, Chang Gung University, Kwei-Shan, Taoyuan 333, Taiwan.
⁶ Department of Post-Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan.
⁷ Department of Radiation Oncology, Keelung Chang Gung Memorial Hospital, Keelung 204, Taiwan.

PMID: 41463641
PMCID: PMC12729448
DOI: 10.3390/bioengineering12121345

An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Wing-Keen Yap et al. Bioengineering (Basel). 2025.

. 2025 Dec 10;12(12):1345.

doi: 10.3390/bioengineering12121345.

Authors

Wing-Keen Yap¹, Shih-Chun Cheng², Chia-Hsin Lin^{1

3}, Ing-Tsung Hsiao⁴, Tsung-You Tsai⁵, Wing-Lake Yap⁶, Willy Po-Yuan Chen¹, Chien-Yu Lin¹, Shih-Ming Huang⁷

Affiliations

¹ Department of Radiation Oncology, Proton and Radiation Therapy Center, Linkou Chang Gung Memorial Hospital, College of Medicine, Chang Gung University, Kwei-Shan, Taoyuan 333, Taiwan.
² Department of Medical Imaging and Radiological Sciences, College of Medicine, Chang Gung University, Taoyuan 333, Taiwan.
³ UTHealth Graduate School of Biomedical Sciences, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
⁴ Department of Medical Imaging and Radiological Sciences, Healthy Aging Research Center, Chang Gung University, Taoyuan 333, Taiwan.
⁵ Department of Otolaryngology-Head and Neck Surgery, Linkou Chang Gung Memorial Hospital, College of Medicine, Chang Gung University, Kwei-Shan, Taoyuan 333, Taiwan.
⁶ Department of Post-Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan.
⁷ Department of Radiation Oncology, Keelung Chang Gung Memorial Hospital, Keelung 204, Taiwan.

PMID: 41463641
PMCID: PMC12729448
DOI: 10.3390/bioengineering12121345

Abstract

Background: Laryngeal and hypopharyngeal cancers (LHCs) exhibit heterogeneous outcomes after definitive radiotherapy (RT). Large language models (LLMs) may enhance prognostic stratification by integrating complex clinical and imaging data. This study validated two pre-trained LLMs-GPT-4o-2024-08-06 and Gemma-2-27b-it-for outcome prediction in LHC. Methods: Ninety-two patients with non-metastatic LHC treated with definitive (chemo)radiotherapy at Linkou Chang Gung Memorial Hospital (2006-2013) were retrospectively analyzed. First-order and 3D radiomic features were extracted from intra- and peritumoral regions on pre- and mid-RT CT scans. LLMs were prompted with clinical variables, radiotherapy notes, and radiomic features to classify patients as high- or low-risk for death, recurrence, and distant metastasis. Model performance was assessed using sensitivity, specificity, AUC, Kaplan-Meier survival analysis, and McNemar tests. Results: Integration of radiomic features significantly improved prognostic discrimination over clinical/RT plan data alone for both LLMs. For death prediction, pre-RT radiomics were the most predictive: GPT-4o achieved a peak AUC of 0.730 using intratumoral features, while Gemma-2-27b reached 0.736 using peritumoral features. For recurrence prediction, mid-RT peritumoral features yielded optimal performance (AUC = 0.703 for GPT-4o; AUC = 0.709 for Gemma-2-27b). Kaplan-Meier analyses confirmed statistically significant separation of risk groups: pre-RT intra- and peritumoral features for overall survival (for both GPT-4o and Gemma-2-27b, p < 0.05), and mid-RT peritumoral features for recurrence-free survival (p = 0.028 for GPT-4o; p = 0.017 for Gemma-2-27b). McNemar tests revealed no significant performance difference between the two LLMs when augmented with radiomics (all p > 0.05), indicating that the open-source model achieved comparable accuracy to its proprietary counterpart. Both models generated clinically coherent, patient-specific rationales explaining risk assignments, enhancing interpretability and clinical trust. Conclusions: This external validation demonstrates that pre-trained LLMs can serve as accurate, interpretable, and multimodal prognostic engines for LHC. Pre-RT radiomic features are critical for predicting mortality and metastasis, while mid-RT peritumoral features uniquely inform recurrence risk. The comparable performance of the open-source Gemma-2-27b-it model suggests a scalable, cost-effective, and privacy-preserving pathway for the integration of LLM-based tools into precision radiation oncology workflows to enhance risk stratification and therapeutic personalization.

Keywords: hypopharyngeal cancer; large language models; laryngeal cancer; prognosis; radiomics; survival prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1**
The McNemar test assessed the significance of differences between paired models for overall survival prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.

**Figure 2**
The McNemar test assessed the significance of differences between paired models for recurrence prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.

**Figure 3**
The McNemar test assessed the significance of differences between paired models for distant metastasis prediction using AUROC scores. Cells with p < 0.05 are highlighted in red.

**Figure 4**
Kaplan–Meier survival curves for overall survival stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

**Figure 5**
Kaplan–Meier survival curves for recurrence stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

**Figure 6**
Kaplan–Meier survival curves for distant metastasis stratified by the GPT-4o model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

**Figure 7**
Kaplan–Meier survival curves for overall survival stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

**Figure 8**
Kaplan–Meier survival curves for recurrence stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

**Figure 9**
Kaplan–Meier survival curves for distant metastasis stratified by the Gemma-2-27b model: (A) with pre-RT intratumoral features, (B) with pre-RT peritumoral features, (C) with mid-RT intratumoral features, and (D) with mid-RT peritumoral features.

See this image and copyright information in PMC

References

1. Amar A., de Almeida J.R., Kanda J.L., de Paula S.M.T., Lessa M.M. Epidemiological assessment and therapeutic response in hypopharyngeal cancer. Braz. J. Otorhinolaryngol. 2013;79:500–504. doi: 10.5935/1808-8694.20130089. - DOI - PMC - PubMed
1. Luo X., Yu F., Xu C., Deng Z., Zeng Y., Zhao X., Zeng X. Evaluation of the prevalence of metachronous second primary malignancies in hypopharyngeal carcinoma and their effect on outcomes. Cancer Med. 2022;11:1059–1067. doi: 10.1002/cam4.4501. - DOI - PMC - PubMed
1. Bray F., Laversanne M., Sung H., de Martel C., Ferlay J., Brooks F., Mery L. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024;74:229–263. doi: 10.3322/caac.21834. - DOI - PubMed
1. Baumann R., Linge A., Zips D. Targeting hypoxia to overcome radiation resistance in head & neck cancers: Real challenge or clinical fairytale? Expert Rev. Anticancer Ther. 2016;16:751–758. - PubMed
1. Huang G., Pan S.T. ROS-Mediated Therapeutic Strategy in Chemo-/Radiotherapy of Head and Neck Cancer. Oxidative Med. Cell. Longev. 2020;2020:5047987. doi: 10.1155/2020/5047987. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Affiliations

An External Validation Study on Two Pre-Trained Large Language Models for Multimodal Prognostication in Laryngeal and Hypopharyngeal Cancer: Integrating Clinical, Treatment, and Radiomic Data to Predict Survival Outcomes with Interpretable Reasoning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources