Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 1;8(1):134.
doi: 10.1038/s41746-025-01528-y.

Developing a named entity framework for thyroid cancer staging and risk level classification using large language models

Affiliations

Developing a named entity framework for thyroid cancer staging and risk level classification using large language models

Matrix M H Fung et al. NPJ Digit Med. .

Abstract

We developed a named entity (NE) framework for information extraction from semi-structured clinical notes retrieved from The Cancer Genome Atlas-Thyroid Cancer (TCGA-THCA) database and examined Large Language Models (LLMs) strategies to classify the 8th edition of American Joint Committee on Cancer (AJCC) staging and American Thyroid Association (ATA) risk category for patients with well-differentiated thyroid cancer. The NE framework consisted of annotation guidelines development, ground truth labelling, prompting approaches, and evaluation codes. Four LLMs (Mistral-7B-Instruct, Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, and Qwen2.5-7B-Instruct) were offline utilised for information extraction, comparing with expert-curated ground truth. Our framework was developed using 50 TCGA-THCA pathology notes. 289 TCGA-THCA notes and 35 pseudo-clinical cases were used for validation. Taking an ensemble-like majority-vote strategy achieved satisfactory performance for AJCC and ATA in both development and validation sets. Our framework and ensemble classifier optimised efficiency and accuracy of classifying stage and risk category in thyroid cancer patients.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Z.W. is contributing to npj Digital Medicine as an Associate Editor and Guest Editor for the Collection on Natural Language Processing in Clinical Medicine. Other authors declared no competing interests.

Figures

Fig. 1
Fig. 1. Flowchart of patient selection process.
Flowchart depicting patient selection and the data source used as development set and validation set. Cancer stages and ATA risks of all TCGA-THCA patients and pseudo cases were verified by endocrine surgeons. A pseudo case of non-invasive follicular thyroid neoplasm with papillary like nuclear features is not grade with AJCC staging and ATA risk.
Fig. 2
Fig. 2. Flow of data extraction using LLMs and classifying ATA risk and AJCC staging from the LLM output.
Schematic diagram depicting the flow of data extraction using LLMs and the utilization of self-developed Microsoft Excel template for data cleaning and classification.
Fig. 3
Fig. 3. Heatmap of performance of Large Language Models on classification of ATA risks and AJCC staging in 50 TCGA pathology reports for NE framework development.
LLMs with various prompting strategies attained satisfactory performance in NE framework development. a Performance on ATA risk classification with F1-scores 88.0–100.0%. b Performance on AJCC staging with F1-scores of 90.3–100.0%.
Fig. 4
Fig. 4. Heatmap of performance of ensemble classifiers on classification of ATA risks and AJCC staging in the development and validation sets.
Ensemble classifiers attained satisfactory performance on the two datasets. a Performance on ATA risk classification with F1-scores at least 88.5%. b Performance on AJCC staging with F1-scores of at least 90.4%.
Fig. 5
Fig. 5. Heatmap of performance of Large Language Models on classification of ATA risks and AJCC staging in 289 TCGA pathology reports for validation.
LLMs with various prompting strategies attained satisfactory performance in 289 TCGA pathology reports for validation. a Performance on ATA risk classification. with F1-scores 88.5–96.5%. b Performance on AJCC staging with F1-scores 94.2–99.7%.
Fig. 6
Fig. 6. Heatmap of performance of Large Language Models on classification of ATA risks and AJCC staging in 35 pseudo cases for validation.
The performance of LLMs various in different approach and in individual LLM in the 35 pseudo cases for validation. a Performance on ATA risk classification. Mistral-7B-Instruct-v0.3 outperformed other LLMs with F1-score of 94.3%. b Performance on AJCC staging. Llama-3.1-8B-Instruct outperformed other LLMs with F1-score of 97.5%.

References

    1. Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin.73, 17–48 (2023). - PubMed
    1. World Health Organization. Age-Standardized Rate (World) per 100 000, Incidence and Mortality, Both sexes, in 2022. 2024 [cited Aug 2, 2024]Available from: https://gco.iarc.fr/today/en/dataviz/bars?types=0_1&mode=cancer&group_po....
    1. Boucai, L., Zafereo, M. & Cabanillas, M. E. Thyroid cancer: A review. JAMA331, 425–435 (2024). - PubMed
    1. Liu, Y. et al. Radioiodine therapy in advanced differentiated thyroid cancer: Resistance and overcoming strategy. Drug Resist Updat.68, 100939 (2023). - PubMed
    1. Haugen, B. R. et al. 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer: The American Thyroid Association Guidelines Task Force on Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid26, 1–133 (2016). - PMC - PubMed

LinkOut - more resources