Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 26;28(12):114231.
doi: 10.1016/j.isci.2025.114231. eCollection 2025 Dec 19.

A deep learning fusion network trained with medical records and laryngoscopic images in the early diagnosis of glottic carcinoma

Affiliations

A deep learning fusion network trained with medical records and laryngoscopic images in the early diagnosis of glottic carcinoma

Yi Shuai et al. iScience. .

Abstract

Early diagnosis of glottic carcinoma is crucial for improving the therapeutic outcomes of patients. This study aims to develop a deep learning fusion network that integrates analysis of structured medical records with laryngoscopic images to enable early and accurate diagnosis of glottic carcinoma. The model was trained and validated on data from a tertiary hospital in China. External validation was subsequently conducted across another two independent medical centers. Monomodal reference models were also developed for comparative analysis. To benchmark clinical utility, a human-machine adversarial cohort was constructed to enable direct performance comparisons between the model and human raters. Diagnostic accuracy was quantified using the area under the receiver operating characteristic curve (AUC). The model achieved superior diagnostic performance compared to monomodal models and achieved performance comparable to senior otolaryngologists. VLMN holds significant potential to reduce diagnostic delays and improve patient prognosis, particularly for junior otolaryngologists or in medically underserved areas.

Keywords: Artificial intelligence; Oncology.

PubMed Disclaimer

Conflict of interest statement

The authors XM.F., ZH.J., Y.S., Y.L., WB.L., and WQ.C. report a patent for No.2025103725379 pending.

Figures

None
Graphical abstract
Figure 1
Figure 1
Recruitment flowchart for patients in this study FAHSYSU, the First Affiliated Hospital of Sun Yat-sen University; FPHFS, the First People’s Hospital of Foshan; FPHZQ, the First People’s Hospital of Zhaoqing. Training cohorta: the training cohort included advanced glottic carcinoma cases with T3 and T4 stages.
Figure 2
Figure 2
ROC curve of text-based monomodal model, image-based monomodal model, VLMN model, and human raters in the evaluation cohorts (A) ROC curves of text-based monomodal model, image-based monomodal model, and VLMN model in the internal test cohorts. (B) ROC curves of text-based monomodal model, image-based monomodal model, and VLMN model in the external test cohorts. (C) ROC curves of text-based monomodal model, image-based monomodal model, VLMN model, and human raters in the human-machine adversarial cohort. (D) ROC curves of monomodal models, VLMN model, human raters alone, and human raters with VLMN in the human-machine adversarial cohort. ROC, receiver operating characteristic; AUC, area under the ROC curve. VLMN, vision large language model based multimodal fusion network model.
Figure 3
Figure 3
Confusion matrices of DL models and human raters in the evaluation cohorts (A) Confusion matrices of text-based monomodal model, image-based monomodal model, and VLMN model in the internal test cohorts. (B) Confusion matrices of text-based monomodal model, image-based monomodal model, and VLMN model in the external test cohorts. (C) Confusion matrices of the VLMN model and human raters in the human-machine adversarial cohort. (D) Confusion matrices of the VLMN model and human raters with VLMN in the human-machine adversarial cohort. VLMN, vision large language model based multimodal fusion network model.
Figure 4
Figure 4
DCA curve of the VLMN model, image-based monomodal model, and text-based monomodal model in the test cohorts (A) DCA curves of the VLMN model, image-based monomodal model, and text-based monomodal model in the internal test cohorts. (B) DCA curves of the VLMN model, image-based monomodal model, and text-based monomodal model in the external test cohorts. DCA, decision curve analysis curve; VLMN, vision large language model based multimodal fusion network model.
Figure 5
Figure 5
Representative heat maps of the VLMN model’s analysis using laryngoscopic images and the sentence-level reports extracted from medical record text The varying colors reflect different levels that the VLMN model concentrates on. For the laryngoscopic images, a redder color on the heatmap indicates a greater contribution of that region’s features to the model’s prediction. The sentence-level reports are standardized reports extracted from medical record text using prompts based on the Llama large model; a darker blue on the heatmap indicates closer attention of the model. (A) The heat maps of the laryngoscopic image and the sentence-level report from a patient diagnosed with vocal cord dysplasia. (B) The heat maps of the laryngoscopic image and the sentence-level report from a patient diagnosed with T1-stage glottic carcinoma. (C) The heat maps of laryngoscopic image and sentence-level report from a patient diagnosed with T2-stage glottic carcinoma. VLMN, vision large language model based multimodal fusion network model.
Figure 6
Figure 6
Overview of the study design FAHSYSU, the First Hospital Affiliated to Sun Yat-sen University; FPHFS, the First People’s Hospital of Foshan; FPHZQ, the First People’s Hospital of Zhaoqing; LlaMA, large language model meta AI; VLMN, vision large language model based multimodal fusion network model; ROC, receiver operating characteristic.
Figure 7
Figure 7
Detailed workflow for the data processing and the development of the VLMN model (A) The vision transform (Vit) model for extracting features from the laryngoscope images. (B) The module for generating sentence-level reports and extracting features from clinical reports. (C) The module for the fusion of multimodal features. (D) The module classify fusion features.

References

    1. Steuer C.E., El-Deiry M., Parks J.R., Higgins K.A., Saba N.F. An update on larynx cancer. CA Cancer J. Clin. 2017;67:31–50. doi: 10.3322/caac.21386. - DOI - PubMed
    1. Bray F., Laversanne M., Sung H., Ferlay J., Siegel R.L., Soerjomataram I., Jemal A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024;74:229–263. doi: 10.3322/caac.21834. - DOI - PubMed
    1. Groome P.A., O’Sullivan B., Irish J.C., Rothwell D.M., Schulze K., Warde P.R., Schneider K.M., Mackenzie R.G., Hodson D.I., Hammond J.A., et al. Management and outcome differences in supraglottic cancer between Ontario, Canada, and the Surveillance, Epidemiology, and End Results areas of the United States. J. Clin. Oncol. 2003;21:496–505. doi: 10.1200/JCO.2003.10.106. - DOI - PubMed
    1. LeBlanc B.J., Shi R., Mehta V., Mills G., Ampil F., Nathan C.-A.O. Improvements in survival and disparities for advanced-stage laryngeal cancer. JAMA Otolaryngol. Head Neck Surg. 2015;141:169–173. doi: 10.1001/jamaoto.2014.2998. - DOI - PubMed
    1. Megwalu U.C., Sikora A.G. Survival outcomes in advanced laryngeal cancer. JAMA Otolaryngol. Head Neck Surg. 2014;140:855–860. doi: 10.1001/jamaoto.2014.1671. - DOI - PubMed

LinkOut - more resources