Improving the Precision of Deep-Learning-Based Head and Neck Target Auto-Segmentation by Leveraging Radiology Reports Using a Large Language Model

Libing Zhu¹, Jean-Claude M Rwigema¹, Xue Feng², Bilaal Ansari^{1

3}, Jingwei Duan⁴, Yi Rong¹, Quan Chen¹

Affiliations

¹ Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ 85058, USA.
² Carina Medical LLC., Lexington, KY 40513, USA.
³ Department of Physics, Lake Forest College, Lake Forest, IL 60045, USA.
⁴ Department of Radiation Oncology, The University of Alabama at Birmingham, Birmingham, AL 35233, USA.

PMID: 40563585
PMCID: PMC12191202
DOI: 10.3390/cancers17121935

Improving the Precision of Deep-Learning-Based Head and Neck Target Auto-Segmentation by Leveraging Radiology Reports Using a Large Language Model

Libing Zhu et al. Cancers (Basel). 2025.

. 2025 Jun 10;17(12):1935.

doi: 10.3390/cancers17121935.

Authors

Libing Zhu¹, Jean-Claude M Rwigema¹, Xue Feng², Bilaal Ansari^{1

3}, Jingwei Duan⁴, Yi Rong¹, Quan Chen¹

Affiliations

¹ Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ 85058, USA.
² Carina Medical LLC., Lexington, KY 40513, USA.
³ Department of Physics, Lake Forest College, Lake Forest, IL 60045, USA.
⁴ Department of Radiation Oncology, The University of Alabama at Birmingham, Birmingham, AL 35233, USA.

PMID: 40563585
PMCID: PMC12191202
DOI: 10.3390/cancers17121935

Abstract

Background/Objectives: The accurate delineation of primary tumors (GTVp) and metastatic lymph nodes (GTVn) in head and neck (HN) cancers is essential for effective radiation treatment planning, yet remains a challenging and laborious task. This study aims to develop a deep-learning-based auto-segmentation (DLAS) model trained on external datasets with false-positive elimination using clinical diagnosis reports. Methods: The DLAS model was trained on a multi-institutional public dataset with 882 cases. Forty-four institutional cases were randomly selected as the external testing dataset. DLAS-generated GTVp and GTVn were validated against clinical diagnosis reports to identify false-positive and false-negative segmentation errors using two large language models: ChatGPT-4 and Llama-3. False-positive ruling out was conducted by matching the centroids of AI-generated contours with the slice locations or anatomical regions described in the reports. Performance was evaluated using the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and tumor detection precision. Results: ChatGPT-4 outperformed Llama-3 in accurately extracting tumor locations from the diagnostic reports. False-positive contours were identified in 15 out of 44 cases. The DSC_mean of the DLAS contours for GTVp and GTVn increased from 0.68 to 0.75 and from 0.69 to 0.75, respectively, after the ruling-out process. Notably, the average HD95 value for GTVn decreased from 18.81 mm to 5.2 mm. Post ruling out, the model achieved 100% precision for GTVp and GTVn when compared with the results of physician-determined contours. Conclusions: The false-positive ruling-out approach based on diagnostic reports effectively enhances the precision of DLAS in the HN region. The model accurately identifies the tumor location and detects all false-negative errors.

Keywords: GTV; auto-segmentation; clinical diagnosis report; head and neck; large language model.

PubMed Disclaimer

Conflict of interest statement

Quan Chen and Xue Feng are co-founders of Carina Medical LLC. Quan Chen received a National Institutes of Health Small Business Innovation Research subcontract from Carina Medical LLC. (NIH R44CA25844). All other authors have no conflicts of interest to declare.

Figures

**Figure 1**
Workflow for ruling out FP contours with detailed tumor location from the diagnosis report.

**Figure 2**
Workflow for ruling out FP contours with GTVn LN group description in diagnosis report.

**Figure 3**
Bounding box creation from the anatomic region in the diagnosis reports. (Blue cube indicates the bounding box generated from anatomic descriptions provided in diagnostic reports).

**Figure 4**
Auto-segmentation results of 44 cases for GTVp and GTVn. (a) DSC (b) Hausdorff distance 95th percentile, mean value is represented by Δ, and outliers by ×. Box ranges are 25th and 75th percentile. Comparison of DLAS contours (GTVn) before and after the ruling-out process in terms of DSC (c) and HD95 (d) for false-positive cases (*: p-value < 0.05).

**Figure 5**
Example cases of false-positive errors for GTVp (a) and GTVn (b) and false negative errors for GTVn with small SUV value (c–i) (golden contour indicates DLAS nodes, light blue indicates RadOnc manual contour).

See this image and copyright information in PMC

References

1. Javanbakht M. Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived with Disability, and Disability-Adjusted Life-years for 32 Cancer Groups, 1990 to 2015, A Systematic Analysis for the Global Burden of Disease Study. JAMA Oncol. 2017;3:524–548. - PMC - PubMed
1. Andrearczyk V., Oreiller V., Boughdad S., Rest C.C.L., Elhalawani H., Jreige M., Prior J.O., Vallières M., Visvikis D., Hatt M., et al. Overview of the HECKTOR Challenge at MICCAI 2021, Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT Images. Springer International Publishing; Cham, Switzerland: 2022.
1. Kihara S., Koike Y., Takegawa H., Anetai Y., Nakamura S., Tanigawa N., Koizumi M. Clinical target volume segmentation based on gross tumor volume using deep learning for head and neck cancer treatment. Med. Dosim. 2023;48:20–24. doi: 10.1016/j.meddos.2022.09.004. - DOI - PubMed
1. van der Veen J., Gulyban A., Nuyts S. Interobserver variability in delineation of target volumes in head and neck cancer. Radiother. Oncol. 2019;137:9–15. doi: 10.1016/j.radonc.2019.04.006. - DOI - PubMed
1. Gudi S., Ghosh-Laskar S., Agarwal J.P., Chaudhari S., Rangarajan V., Paul S.N., Upreti R., Murthy V., Budrukkar A., Gupta T. Interobserver variability in the delineation of gross tumour volume and specified organs-at-risk during IMRT for head and neck cancers and the impact of FDG-PET/CT on such variability at the primary site. J. Med. Imaging Radiat. Sci. 2017;48:184–192. doi: 10.1016/j.jmir.2016.11.003. - DOI - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- MDPI
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Improving the Precision of Deep-Learning-Based Head and Neck Target Auto-Segmentation by Leveraging Radiology Reports Using a Large Language Model

Affiliations

Improving the Precision of Deep-Learning-Based Head and Neck Target Auto-Segmentation by Leveraging Radiology Reports Using a Large Language Model

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources