Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 23;26(1):50.
doi: 10.1186/s12880-025-02092-3.

LLM-powered TNM staging of neuroendocrine tumors from PET/CT reports

Affiliations

LLM-powered TNM staging of neuroendocrine tumors from PET/CT reports

Markus Mergen et al. BMC Med Imaging. .

Abstract

Purpose: Imaging reports are essential for the diagnostic evaluation, treatment planning, and follow-up of patients with neuroendocrine tumors (NETs) of the gastroenteropancreatic (GEP) system. The tumor-node metastasis (TNM) classification is a common model for evaluating the prognostic value of tumor patients. However, their traditional free-text format varies in structure, detail, and clarity, leading to inconsistencies and potential omissions of critical information necessary for optimal patient management. Recent advancements in large language models (LLMs) have created new opportunities for automating complex medical assessments, including the extraction of UICC and ENETS staging classifications from imaging reports. This approach aims to improve standardization, enhance clarity, and ensure consistency, ultimately facilitating more effective multidisciplinary clinical decision-making. This study evaluates whether large language models (LLMs) can infer UICC and ENETS TNM stage for GEP‑NETs from PET/CT free‑text reports that contain descriptive findings only (no explicit TNM labels).

Methods: We evaluated several models, including ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash, on a physician-generated fictitious dataset of 108 PET/CT reports with expert-annotated TNM classifications according to UICC and ENETS criteria. Model performance was assessed through F1-scores, comparing LLM-generated classifications against human expert benchmarks.

Results: Among the tested models, ChatGPT-4o demonstrated the highest accuracy, achieving microF1 scores of 0.79, 0.99 and 0.99, for T, N and M according to UICC and 0.84, 1.00 and 0.99 respectively, according to ENETS. These results indicate that LLMs have the potential to assist in oncologic staging of NETs, especially offering support for non-specialists in clinical decision-making. However, before integration into routine practice, further prospective validation and rigorous evaluation in real-world settings are necessary.

Conclusion: This study underscores the promise of LLMs in oncologic workflows while highlighting the importance of robust benchmarking and clinical validation.

Supplementary Information: The online version contains supplementary material available at 10.1186/s12880-025-02092-3.

Keywords: Clinical decision support; Large language models; Neuroendocrine tumors; PET/CT; TNM staging.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This retrospective study was approved by the Ethics Committee of the Technical University of Munich (Approval ID: 2024-590-S-CB). The requirement for individual informed consent was waived by the ethics committee due to the retrospective design and use of fully anonymized clinical data. The study was conducted in accordance with the Declaration of Helsinki and institutional guidelines. Consent to publication: Not applicable Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
(A) Schematic representation of the workflow for classifying NET UICC and ENETS TNM based on PET/CT reports. The process includes PET/CT report generation, LLM-based classification, and validation against ground truth. (B) Radar chart illustrating the hit rate for UICC TNM classification by ChatGPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash. Key findings include ChatGPT-4o´s high F1 scores (0.79) for T classification. (C) shows a heatmap depicting the differences in performance for classifying T, N and M stage across models used
Fig. 2
Fig. 2
Confusion matrices for ChatGPT-4o for the key attributes of interest. (A) UICC T stage, (B) UICC N stage and (C) UICC M stage
Fig. 3
Fig. 3
Radar chart illustrating the hit rate for ENETS TNM classification by GPT-4o, DeepSeek V3, Claude 3.5 Sonnet, and Gemini 2.0 Flash
Fig. 4
Fig. 4
Confusion matrices for ChatGPT-4o for the key attributes of interest. (A) ENETS T stage, (B) ENETS N stage, (C) ENETS M stage

References

    1. Khosravi M, et al. Artificial intelligence and Decision-Making in healthcare: A thematic analysis of a systematic review of reviews. Health Serv Res Manag Epidemiol. 2024;11:23333928241234863. - PMC - PubMed
    1. Niraula D, et al. Intricacies of human-AI interaction in dynamic decision-making for precision oncology. Nat Commun. 2025;16(1):1138. - DOI - PMC - PubMed
    1. Menezes MCS, et al. The potential of generative Pre-trained transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study. Lancet Digit Health. 2025;7(1):e35–43. - DOI - PMC - PubMed
    1. Nakakura EK. Challenges staging neuroendocrine tumors of the Pancreas, jejunum and Ileum, and appendix. Ann Surg Oncol. 2018;25(3):591–3. - DOI - PubMed
    1. Bi WL, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin. 2019;69(2):127–57. - PMC - PubMed

LinkOut - more resources