Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 30;14(4):1118-1137.
doi: 10.21037/tlcr-24-875. Epub 2025 Apr 25.

Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers

Affiliations

Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers

Ning Liu et al. Transl Lung Cancer Res. .

Abstract

Background: Discrimination of multiple non-small cell lung cancers (NSCLCs) as multiple primary lung cancers (MPLCs) or intrapulmonary metastases (IPMs) is critical but remains challenging. The aim of this study is to develop and validate the machine learning (ML) models based on the molecular features for estimating the probability of MPLC or IPM for patients presenting multiple NSCLCs.

Methods: A total of 72 multiple NSCLCs patients with 157 surgical resection tumor lesions from January 2012 to January 2018 at two institutions were included for developing and testing models. Specifically, 46 patients with 103 tumors which were defined as definitive MPLC or IPM according to International Association for the Study of Lung Cancer (IASLC) criteria were used to develop models. They were spilt into training and validation sets using stratified random sampling and five-fold cross-validation. The developed models were tested in other 26 patients whose tumors were undetermined by traditional methods. Whole-exome sequencing (WES) was performed on all included tumor samples. Four molecular features were calculated to characterize tumors relatedness and served as model inputs, including genetic divergence, shared mutation number, Pearson correlation coefficient and early mutation number. Decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) were employed, with performance assessed by areas under the curve (AUCs), accuracy, precision, recall, and F1 score in validation set. Disease-free survival (DFS) were used to evaluate model performance in test cohort. Clinical and genetic characteristics were then compared between MPLC and IPM populations.

Results: All of the four molecular features showed significant differences between MPLC and IPM patients in development cohort. That is, MPLC exhibited higher genetic divergence, lower shared mutation number, Pearson correlation and early mutation number than IPM (P<0.001). DT model, RF model and GBDT model were developed with these factors and achieved a mean AUC of 0.94 [standard deviation (SD) 0.09], 1.00 (SD 0.00) and 1.00 (SD 0.00) in validation set, respectively. DT model, RF model and GBDT model discriminated the undetermined multiple NSCLCs as MPLC (n=15) and IPM (n=11) consistently. MPLC identified by ML models had significantly prolonged DFS [hazard ratio =0.21; 95% confidence interval (CI): 0.04-1.0; P=0.04] than that of IPM. MPLC patients had a relative higher prevalence of family history of first-degree relatives with cancer, and more than half of these patients reported a family history of lung cancer. EGFR remains the most common mutated driver both in MPLC and IPM populations.

Conclusions: ML models based on the molecular features effectively distcriminate primary tumors from metastases in multiple NSCLCs, which improve the accuracy of multiple NSCLCs diagnosis and assist in clinical decision-making, particularly in challenging cases.

Keywords: Multiple primary lung cancer (MPLC); intrapulmonary metastases (IPMs); machine learning (ML); non-small cell lung cancer (NSCLC).

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-24-875/coif). F.X. and W.C. are currently employees of Genecast Biotechnology Co., Ltd. The other authors have no conflicts of interest to declare.

Figures

Figure 1
Figure 1
Participant flow diagram. IPM, intrapulmonary metastasis; ML, machine learning; MPLC, multiple primary lung cancer; NSCLC, non-small cell lung cancer; WES, whole-exome sequencing.
Figure 2
Figure 2
Graphical summary of the study design. (A) Collection of multiple lung tumors and their corresponding normal lung tissues, followed by WES. (B) Four moleculer features analysis based on WES data. (C) ML predictive models establishment using three ML algorithms based on four molecular features in models development cohort. Representative histology and histological subtype appearences of MPLC lesions were illustrated by hematoxylin-eosin (scale bar =50 μm). Representative IPM tumors CT scan. Red circles indicate sites of tumors. (D) Application of the trained ML models to patients in the test cohort. Full details of the analyses are provided in the main text and Appendix 1. AAH, atypical adenomatous hyperplasia; ADC, adenocarcinoma; AIS, adenocarcinoma in situ; CT, computed tomography; DT, decision trees; GBDT, gradient boosting decision trees; GL, germ line; IAC, invasive adenocarcinoma; IPM, intrapulmonary metastasis; MIA, minimally invasive adenocarcinoma; ML, machine learning; MPLC, multiple primary lung cancer; RF, random forests; SI, Shannon Index; SCC, squamous cell carcinoma; T1, tumor 1; T2, tumor 2; WES, whole-exome sequencing.
Figure 3
Figure 3
Molecular features of MPLC and IPM patients in development cohort. (A) The minimum pairwise genetic divergence index (ΔSI) distribution of MPLC and IPM patients. (B) The numbers of identified somatic mutations which were found to be either shared or unique in tumor pairs. (C) Pearson correlation analysis of mutations in paired tumors. (D) Early mutation number based on the phylogenetic trees for tumor pairs. Statistical significance was established at the levels of ****, P<0.001. IPM, intrapulmonary metastasis; MPLC, multiple primary lung cancer; SI, Shannon Index.
Figure 4
Figure 4
ML models performance in the validation set. ROC curves of the (A) DT, (B) RF and (C) GBDT algorithms using five-fold cross-validation for classification of multiple NSCLCs in the validation set. AUC, area under the curve; DT, decision trees; GBDT, gradient boosting decision trees; ML, machine learning; NSCLCs, non-small cell lung cancers; RF, random forests; ROC, receiver operating characteristic.
Figure 5
Figure 5
ML models performance in test cohort. (A) The minimum pairwise genetic divergence indices (ΔSI). (B) The numbers of identified somatic mutations which were found to be either shared or unique in tumor pairs. (C) Pearson correlation analysis of mutations in paired tumors. (D) Early mutation number based on the phylogenetic trees for tumor pairs. (E) DFS curve for MPLC and IPM patients classified by ML models. MPLC patients identified by ML models showed an advantage in DFS (P<0.05). (F) Comparison of ACCP criteria assessment with clonality defined by ML models. Chest CT scan and hematoxylin-eosin staining sections (scale bar =50 μm) of multiple primary tumors from case Pt84 (G) and intrapulmonary metastasis from case Pt75 (H). The red arrowheads indicate sites of tumors. (I) DFS curve for MPLC and IPM patients classified according to the ACCP guideline. None of significant difference between two groups were found. Statistical significance was established at the levels of ****, P<0.001. ACCP, American College of Chest Physicians; CI, confidence interval; CT, computed tomography; DFS, disease-free survival; HR, hazard ratio; IPM, intrapulmonary metastasis; LLL, left lower lobe; ML, machine learning; MPLC, multiple primary lung cancer; RLL, right lower lobe; RML, right middle lobe; SI, Shannon Index; T1, tumor 1; T2, tumor 2.
Figure 6
Figure 6
Somatic mutations analysis of multiple NSCLCs. (A) The genetic landscape of high-frequency molecular alterations detected in 157 samples. The frequency of each mutation is shown on the right. The types of alteration are represented by the colors indicated. (B) Bar plot of the prevalence of EGFR mutation (light blue) in MPLC at patient level. Pie graph showed the mutation concordance and the number of patients with each mutation pattern. (C) Frequency distributions of EGFR mutation subtypes in MPLC samples. (D) TMB distribution in MPLC and IPM samples. (E) Line graph of the TMB change between paired tumors in MPLC and IPM patients. Statistical significance was established at the levels of **, P<0.01. IPM, intrapulmonary metastasis; Mb, megabase; MPLC, multiple primary lung cancer; Mut, mutation; NSCLCs, non-small cell lung cancers; Pts, patients; SNV, single nucleotide variant; T1, tumor 1; T2, tumor 2; T3, tumor 3; T4, tumor 4; T5, tumor 5; TMB, tumor mutation burden; WT, wild type.
Figure 7
Figure 7
Germline mutations analysis of multiple NSCLCs. (A) The P/LP germline mutation spectrum for MPLC and IPM patients in this study. (B) Bar plot indicating the prevalence of P/LP germline variants in MPLC and IPM patients. (C) Frequency of family cancer history in MPLC patients with or without P/LP variants. (D) The age of onset for MPLC patients with or without P/LP mutations. (E) Bar plots show the frequency of P/LP germline variants in MPLC patients (yellow), females (pink) and males (gray) of different ages. Fam, family cancer history; IPM, intrapulmonary metastasis; MPLC, multiple primary lung cancer; NS, nonsignificant; NSCLCs, non-small cell lung cancers; P/LP, pathogenic or likely pathogenic; Pts, patients; SNV, single nucleotide variant.

Similar articles

Cited by

References

    1. Sung H, Ferlay J, Siegel RL, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. 10.3322/caac.21660 - DOI - PubMed
    1. Han J, Liu Y, Yang S, et al. MEK inhibitors for the treatment of non-small cell lung cancer. J Hematol Oncol 2021;14:1. 10.1186/s13045-020-01025-7 - DOI - PMC - PubMed
    1. Mascalchi M, Comin CE, Bertelli E, et al. Screen-detected multiple primary lung cancers in the ITALUNG trial. J Thorac Dis 2018;10:1058-66. 10.21037/jtd.2018.01.95 - DOI - PMC - PubMed
    1. Chang JC, Rekhtman N. Pathologic Assessment and Staging of Multiple Non-Small Cell Lung Carcinomas: A Paradigm Shift with the Emerging Role of Molecular Methods. Mod Pathol 2024;37:100453. 10.1016/j.modpat.2024.100453 - DOI - PMC - PubMed
    1. Jensen SØ, Moore DA, Surani AA, et al. Second Primary Lung Cancer - An Emerging Issue in Lung Cancer Survivors. J Thorac Oncol 2024;19:1415-26. 10.1016/j.jtho.2024.07.014 - DOI - PubMed

LinkOut - more resources