Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul:9:e2400279.
doi: 10.1200/CCI-24-00279. Epub 2025 Jul 23.

Automatic Abstraction of Computed Tomography Imaging Indication Using Natural Language Processing for Evaluation of Surveillance Patterns in Long-Term Lung Cancer Survivors

Affiliations

Automatic Abstraction of Computed Tomography Imaging Indication Using Natural Language Processing for Evaluation of Surveillance Patterns in Long-Term Lung Cancer Survivors

Aparajita Khan et al. JCO Clin Cancer Inform. 2025 Jul.

Abstract

Purpose: Despite its routine use to monitor patients with lung cancer (LC), real-world evaluations of the impact of computed tomography (CT) surveillance on overall survival (OS) have been inconsistent. A major confounder is the absence of imaging indications because patients undergo CT scans for purposes beyond surveillance, like symptom evaluations (eg, cough) linked to poor survival. We propose a novel natural language processing model to predict CT imaging indications (surveillance v others).

Methods: We used electronic health records of 585 long-term LC survivors (≥5 years) at Stanford, followed for up to 22 years. Their 3,362 post-5-year CT reports (including 1,672 manually annotated) were used for modeling by integrating structured variables (eg, CT intervals) with key-phrase analysis of radiology reports. Naïve analysis compared OS in patients with CT for any indications (including symptoms) versus those without post-5-year CT, as in previous studies. Using model-predicted indications, we conducted exploratory analyses to compare OS between those with post-5-year surveillance CT and those without.

Results: The model showed high discrimination (AUC, 0.86), with key predictors including a longer interval (≥6-month) from the previous CT (odds ratios [OR], 5.50; P < .001) and surveillance-related key phrases (OR, 1.37; P = .03). Propensity-adjusted survival analysis indicated better OS for patients with any post-5-year surveillance CT versus those without (adjusted hazard ratio, 0.60; P = .016). By contrast, no significant survival difference was found (P = .53) between patients with any CT versus those without post-5-year CT.

Conclusion: Our model abstracted CT indications from real-world data with high discrimination. Exploratory analyses revealed the obscured imaging-OS association when considering indications, highlighting the model's potential for future real-world studies.

PubMed Disclaimer

Conflict of interest statement

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Chloe Su

Employment: Genentech

Stock and Other Ownership Interests: 10× Genomics, Teladoc, Moderna Therapeutics

Consulting or Advisory Role: Revamp Medical

Allison W. Kurian

Other Relationship: Ambry Genetics, Color Genomics, GeneDx/BioReference, InVitae, Genentech, Myriad Genetics, Adela, Merck, Gilead Sciences, Foundation Medicine

Uncompensated Relationships: JScreen, Primum, Roon

Joel Neal

Stock and Other Ownership Interests: SecondLook Health

Honoraria: Research to Practice, Medscape, Projects in Knowledge, MJH Life Sciences, Medical Educator Consortium, PlatformQ/Medlive CME, IDEOlogy Health, Navya Network Inc

Consulting or Advisory Role: AstraZeneca, Genentech/Roche, Takeda, Lilly, Amgen, Iovance Biotherapeutics, Blueprint Medicines, Regeneron, Natera, Gilead Sciences, AbbVie, Summit Therapeutics, Novartis, Novocure, Janssen Oncology, Anheart Therapeutics, Bristol Myers Squibb, Nuvation Bio, Boehringer Ingelheim, Daiichi Sankyo, GlaxoSmithKline, Oxford BioTherapeutics, Taiho Pharmaceutical

Research Funding: Genentech/Roche (Inst), Merck (Inst), Novartis (Inst), Boehringer Ingelheim (Inst), Exelixis (Inst), Nektar (Inst), Takeda (Inst), Adaptimmune (Inst), GlaxoSmithKline (Inst), Janssen (Inst), Revolution Medicines, Nuvalent, Inc

Patents, Royalties, Other Intellectual Property: UpToDate—Royalties

Michael Gould

Consulting or Advisory Role: American Thoracic Society

Research Funding: National Cancer Institute (Inst), Patient-Centered Outcomes Research Institute (PCORI) (Inst)

Patents, Royalties, Other Intellectual Property: Royalties paid to me by UpToDate to co-author topics on lung cancer diagnosis and staging

Heather A. Wakelee

Honoraria: Chugai Pharma

Consulting or Advisory Role: Mirati Therapeutics, OncoC4, IO Biotech, BeiGene, GlaxoSmithKline

Research Funding: Genentech/Roche (Inst), Xcovery (Inst), Bristol Myers Squibb (Inst), Merck (Inst), Seagen (Inst), Helsinn Therapeutics (Inst), Bayer (Inst), AstraZeneca (Inst)

Travel, Accommodations, Expenses: Chugai Pharma

Uncompensated Relationships: Merck, Genentech/Roche, AstraZeneca

Leah M. Backhus

Employment: Stanford University, Department of Veteran Affairs

Consulting or Advisory Role: AstraZeneca, Genentech/Roche, Johnson & Johnson/MedTech

Speakers' Bureau: Johnson & Johnson, AstraZeneca

Research Funding: Department of Veteran Affairs, NIH

Expert Testimony: Kazan LLC

Other Relationship: Genentech

Curtis Langlotz

Leadership: BunkerHill Health, Sirona Medical

Stock and Other Ownership Interests: Whiterabbit.ai, Galileo CDS, Bunker Hill, Inc, Sirona Medical, Adra.ai, Cognita, TurboRadiology

Honoraria: McKinsey & Company, Andrew Huang Educational Foundation

Consulting or Advisory Role: Whiterabbit.ai, Galileo CDS, Bunker Hill, Inc, Sirona Medical, Adra.ai, Cognita, TurboRadiology

Research Funding: Philips Healthcare (Inst), GE Healthcare (Inst), Siemens Healthineers (Inst), Google (Inst), BunkerHill Health (Inst), Carestream (Inst), CARPL (Inst), Clairity, Inc (Inst), IBM (Inst), Lambda (Inst), Lunit (Inst), Microsoft (Inst), Stability.ai (Inst), Subtle Medical (Inst), Visiana (Inst), Whiterabbit.ai (Inst), VinBrain (Inst)

Patents, Royalties, Other Intellectual Property: Patent: GENERALIZABLE MACHINE LEARNING MEDICAL PROTOCOL RECOMMENDATION, Submitted with collaborators from GE Healthcare, Medical Autoencoders and Image Compression, Submitted with collaborators from Stanford

Travel, Accommodations, Expenses: Sectra, SingHealth, Andrew Huang Educational Foundation

Other Relationship: RSNA

No other potential conflicts of interest were reported.

Figures

FIG 1.
FIG 1.
Cohort derivation and partitioning of CT reports for NLP-based modeling for imaging indication. (A) Cohort selection: Selection of the study cohort from patients with LC at SHC for CT indication modeling using NLP. (B) Radiology notes used for NLP model training and validation: Partitioning of CT radiology reports of patients with post–5-year CT into a curation set of manually annotated reports for NLP model development and a reserve set for predicting CT indications on unannotated reports. aNoneligible imaging was defined as imaging that was performed before the initial diagnosis or that involved chest x-rays, PET-CTs, MRI scans, imaging that did not involve the lung (eg, CT neck), and imaging whose CT reports were absent from the EHR database. CT, computed tomography; EHR, electronic health record; LC, lung cancer; MRI, magnetic resonance imaging; NLP, natural language processing; PET, positron emission tomography; SHC, Stanford Health Care.
FIG 2.
FIG 2.
Schematic diagram of proposed hybrid model. The development of the proposed hybrid NLP-based model for predicting CT imaging indications involved an intricate process of integrating and harmonizing data from structured EHRs and unstructured CT reports. Left half: The model extracts various features from structured EHRs (such as CT scan intervals, patient symptoms, and lung disease diagnoses, primarily using ICD9/10 diagnosis codes and CPT procedure codes; see Data Supplement, Method S5). Right half: Unstructured free-text CT radiology reports are processed using a six-step NLP pipeline (outlined in the Data Supplement, Method S4) to extract the occurrence frequency of key phrases related to various aspects of lung cancer, like surveillance, recurrence, metastasis, treatments, and medications, from the CT report. After extracting these two distinct sets of features, the model employs multivariate logistic regression to combine these features to predict/classify the indication of each CT imaging into either surveillance or other reasons (eg, symptoms or metastasis treatment). CPT, current procedural terminology; CT, computed tomography; EHR, electronic health record; ICD, International Classification of Diseases; NLP, natural language processing.
FIG 3.
FIG 3.
Analysis of model performance and association with CT imaging indications. (A) The features included in the proposed hybrid NLP-based model, with the corresponding forest plot illustrating the association between each feature (both structured EHR and NLP features) and the probability that the given CT scan was performed because of surveillance (v others); this was estimated through multivariate logistic regression. Square symbols represent OR estimates, whereas error bars denote the 95% CIs. (B) The comparative performance of the proposed hybrid model against the models using solely structured EHRs or NLP feature subsets, evaluated on a hold-out test data set. Metrics assessed include the AUC, classification accuracy, sensitivity, and specificity (Data Supplement, Method S1). (C) ROCs for the proposed hybrid model, structured EHR-only model, and NLP-only model, displaying their performance on test data. (D) Calibration plot for the proposed model, illustrating the agreement between model-predicted probabilities and observed probabilities in the data. The 45-degree dashed diagonal line represents perfect calibration, with the plotted line in red indicating the actual model performance. CT, computed tomography; EHR, electronic health record; LC, lung cancer; NLP, natural language processing; OR, odds ratios; ROC, receiver operating curve.
FIG 4.
FIG 4.
Evaluating associations between OS and CT surveillance using predicted imaging indications and characterization of temporal patterns in CT surveillance. (A) Naïve survival analysis for OS and any CT scans without imaging indication shows the Kaplan-Meier plot comparing OS between those who received any CT scans (either surveillance or others; n = 585) beyond 5-year survival—thus, not considering imaging indications—and those who did not receive any CT scans (n = 128, in black) beyond 5-year survival whose follow-up duration was matched; the hazard ratio for the association between OS and CT scans was derived using multivariate Cox regression, adjusting for sex, race/ethnicity, initial cancer stage, and histology at diagnosis, with additional correction for selection bias through inverse probability treatment weighting to balance patient characteristics. (B) Survival analysis for OS and CT surveillance using model-predicted imaging indications: Kaplan-Meier plot for analogous analysis conducted in (A) but comparing OS stratified by (1) the patients who received at least one CT classified as surveillance based on the proposed hybrid model (n = 438, in orange) versus (2) those who did not receive any CT scans (n = 128, in black) beyond 5-year survival whose follow-up duration was matched; (C) visualization of temporal surveillance patterns beyond 5 years from the initial diagnosis: Heat map with rows representing patients, columns depicting quarterly time points for 10 years from the 5-year survival, and cell colors indicating model-predicted CT indications (orange: surveillance CT, blue: other reason CT, white: no CT). Hierarchical clustering of patients (rows) based on temporal CT indication profiles revealed two patient groups: one primarily composed of patients with other-reason CT scans (bottom half of heat map), while the other comprised patients with a mix of surveillance and other-reason scans (upper half of heat map). (D) Survival analysis for OS and annual CT surveillance based on temporal analysis shows the analogous analysis conducted in (A) but comparing OS between (1) those who received at least one annual surveillance CT scan (n = 332, in orange) beyond 5-year survival—identified through a moving window approach (see Data Supplement, Method S8) considering imaging indications—and (2) those who did not receive any CT scans (n = 128, in black) beyond 5-year survival. CT, computed tomography; HRadjusted, adjusted hazard ratio was obtained using multivariate Cox regression, adjusting for sex, race/ethnicity, and initial cancer histology at diagnosis, with additional correction for selection bias through inverse probability treatment weighting to balance patient characteristics; OS, overall survival.

Similar articles

References

    1. Miller KD, Nogueira L, Devasia T, et al. : Cancer treatment and survivorship statistics, 2022. CA Cancer J Clin 72:409-436, 2022 - PubMed
    1. National Cancer Institute: SEER Cancer Stat Facts: Lung and Bronchus Cancer. National Cancer Institute. 2023. https://seer.cancer.gov/statfacts/html/lungb.html
    1. Johnson BE: Second lung cancers in patients after treatment for an initial lung cancer. J Natl Cancer Inst 90:1335-1345, 1998 - PubMed
    1. Thakur MK, Ruterbusch JJ, Schwartz AG, et al. : Risk of second lung cancer in patients with previously treated lung cancer: Analysis of surveillance, epidemiology, and end results (SEER) data. J Thorac Oncol 13:46-53, 2018 - PMC - PubMed
    1. Han SS, Rivera GA, Tammemägi MC, et al. : Risk stratification for second primary lung cancer. J Clin Oncol 35:2893-2899, 2017 - PMC - PubMed