Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2025 Jan 2;16(1):84.
doi: 10.1038/s41467-024-55594-z.

Integrated multiomics signatures to optimize the accurate diagnosis of lung cancer

Affiliations
Multicenter Study

Integrated multiomics signatures to optimize the accurate diagnosis of lung cancer

Mengmeng Zhao et al. Nat Commun. .

Abstract

Diagnosing lung cancer from indeterminate pulmonary nodules (IPLs) remains challenging. In this multi-institutional study involving 2032 participants with IPLs, we integrate the clinical, radiomic with circulating cell-free DNA fragmentomic features in 5-methylcytosine (5mC)-enriched regions to establish a multiomics model (clinic-RadmC) for predicting the malignancy risk of IPLs. The clinic-RadmC yields an area-under-the-curve (AUC) of 0.923 on the external test set, outperforming the single-omics models, and models that only combine clinical features with radiomic, or fragmentomic features in 5mC-enriched regions (p < 0.050 for all). The superiority of the clinic-RadmC maintains well even after adjusting for clinic-radiological variables. Furthermore, the clinic-RadmC-guided strategy could reduce the unnecessary invasive procedures for benign IPLs by 10.9% ~ 35%, and avoid the delayed treatment for lung cancer by 3.1% ~ 38.8%. In summary, our study indicates that the clinic-RadmC provides a more effective and noninvasive tool for optimizing lung cancer diagnoses, thus facilitating the precision interventions.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Flowchart illustrating the participants selection scheme.
A participants selecting from Shanghai Pulmonary Hospital; B participants selecting from external 4 hospitals.
Fig. 2
Fig. 2. Performance comparison between fragmentomic models and functional analyses.
A Schematic of the process for determining the first 6-nucleotide sequence (i.e., a 6-mer end motif) on each 5’ fragment end of cfDNA relative to the hg19 reference genome; B Hierarchical clustering analyses of the selected 6 bp end motifs derived from 5mC-sequencing data; Receiver operating characteristic analyses of the epigenomic models on the validation set C internal test set D and external test set (E); F Bar charts showing the TFs identified by the 6-mer end motifs selected from 5mC-sequencing data; G Bar charts showing the regulatory target genes by these identified TFs; H The top 15 most enriched GO terms based on the target genes. The area under the receiver operating characteristic curves are compared via the DeLong’s test. All statistical tests were two-sided, with p < 0.05 indicative of a statistically significant difference. 4bp-5mC, the model established by the 4-mer end motifs selected from 5mC-sequencing data; 6bp-5mC, the model established by the 6-mer end motifs selected 5mC-sequencing data; 4bp-5hmC, the model established by the 4-mer end motifs selected from 5hmC-sequencing data; 6bp-5hmC, the model established by the 6-mer end motifs selected from 5hmC-sequencing data; 5mC, 5-methylcytosine; 5hmC, 5-hydroxymethylcytosine; ROCs, receiver operating characteristic curve; AUC area under the ROCs curve, CI confidence interval, Sens sensitivity, Spec specificity, PPV positive predictive value, NPV negative predictive value, Accur accuracy, TFs transcription factors, GO gene ontology. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Construction procedures, performance comparison and explanation analyses employed for the radiomic models.
Construction procedures utilized for the DL-radiomics model A and the C-radiomics model B; Receiver operating characteristics analyses of the DL-radiomics model and C-radiomics model on the validation set C, the internal test set D, and the external test set E; F, The attribution score distribution of each deep learning-based radiomics feature on the training set; G The impact of the deep learning-based radiomics features included in the DL-radiomics models on the risk probabilities output for the training set; H Volcano plot for differentially expressed genes between the low-risk and high-risk subgroups classified by the DL-radiomics model; I GSEA conducted for upregulated genes; J GSEA conducted for downregulated genes; K ssGSEA conducted for the low-risk and high-risk subgroups classified by the DL-radiomics model, The center of the box denotes the 50th percentile, the bounds of the box contain the 25th to 75th percentiles, the whiskers mark the maximum and minimum values, and values beyond these upper and lower whiskers are considered outliers and are marked with dots. n = 116 biologically independent samples were analyzed. All statistical tests were two-sided, with p < 0.05 indicative of a statistically significant difference. DL-radiomics, the deep learning-based radiomics model score, while C-radiomics referred to the classic radiomics model. ROCs, receiver operating characteristic analyses; AUC, area under the ROCs curve; CI, confidence interval; DL, deep learning; KEGG, Kyoto Encyclopedia of Genes and Genomes; MDSC, myeloid derived suppressor cell; GSEA, gene set enrichment analyses; ssGSEA, single sample gene set enrichment analyses. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Feature interaction and predictive performance analyses of the models.
Pearson correlation coefficients analyses for the features included in the multiomics model on validation set A, internal test set B, and external test set C. Receiver operating characteristic analyses and performance metrics for the models on validation set D, internal test set E, and external test set F. All statistical tests were two-sided, with p < 0.05 indicative of a statistically significant difference, and *** denotes p < 0.001; ** denotes p < 0.01; * denotes p < 0.05. Size, the radiological solid component size of pulmonary nodules; Radiomics, the deep learning-based radiomics model score; 6bp-5mC, the model score established by the 6-mer end motifs selected from 5mC-equencing data; 6bp-5hmC, the model score established by the 6-mer end motifs selected from 5hmC-sequencing data; clinical, the model established by the age and radiological solid component size of pulmonary nodule; clinic-Radiomics, the model established by combining clinical variables with the DL-radiomics model sore; clinic-mC, the model established by combining clinical variables with the 6bp-5mC model score; clinic-RadmC, the model established by combining clinical variables, DL-radiomics model score with the 6bp-5mC model score; clinic-Rad(h)mC, the model established by combining clinical variables, the DL-radiomics model score, the 6bp-5mC model score with the 6bp-5hmC model score. ROCs, receiver operating characteristics analyses; AUC area under the ROCs curve, Sens sensitivity, Spec specificity, PPV positive predictive value, NPV negative predictive value, Accur accuracy. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Explainability analyses result of the clinic-RadmC model.
SHAP analyses results obtained for ranking the impact of the continuous features included in the clinic-RadmC models on the risk probabilities output for the validation set A, internal test set B, and external test set C; 2 example participants with similar clinic-radiological characteristics in the low-risk D and high-risk subgroups E, respectively. Each participant was represented by a single dot on each feature flow. The horizontal position of the dot was determined by the SHAP value of that feature, and dots were accumulated along each feature row to show density values. DL-radiomics, deep learning-based radiomic model; 6bp-5mC, the model established by the 6-mer end motifs selected from 5mC-sequencing data; solid size, the radiological solid component size of pulmonary nodules; SHAP, Shapley additive explanations. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. The performance comparisons for models after adjusting for clinical and radiological factors.
Predictive performance achieved by the models in subgroups stratified by age A, B radiological image CE, and nodule size FH. Pure-GGO comprised nodules with only GGO, and part-solid nodules consisted of GGOs and solid components, whereas pure-solid nodules had only solid components without GGOs. Subcentimeter pulmonary nodules were defined as the nodules with solid component size≤10 mm, and large nodules were defined as those with 15 mm≤solid component size≤30 mm, whereas pulmonary massed were defined as those with solid component size>30 mm. 6bp-5mC, the model established by the 6-mer end motifs selected from 5mC-sequencing data; DL-radiomics, the deep learning-based radiomic model score; clinic-mC, the model established by combining clinical variables with the 6bp-5mC model score; clinic-Radiomics, the model established by combining clinical variables with the DL-radiomics model sore; clinic-Rad(h)mC, the model established by combining clinical variables, the DL-radiomics model score, the 6bp-5mC model score with the 6bp-5hmC model score; clinic-RadmC, the model established by combining clinical variables, the DL-radiomics model score with the 6bp-5mC model score. AUCs areas under the receiver operating characteristics curves; GGO ground-glass opacity. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Reclassification performance achieved by the multiomics models on the combined test set (n = 658).
A Confusion matrices illustrating the predicted outcomes generated by the multiomics model in comparison with the actual outcomes, as well as between the multiomics model and the clinical model B the DL-radiomics model C the 6bp-5mC model D the clinic-Radiomics model E and the clinic-mC model F with emphasis placed on the patients ruled in and ruled out. The dotted lines demarcate the corresponding cutoff values of the different models. The number labeled with * refer to cancer cases misclassified as low-risk samples by the x-axis model but correctly reclassified as high-risk samples by the multiomics model on y-axis, whereas the number labeled with # refer to benign cases misclassified as high-risk samples by the x-axis model but correctly reclassified as low-risk samples by the multiomics model on the y-axis. Source data are provided as a Source Data file.

References

    1. Han, B. et al. Cancer incidence and mortality in China, 2022. JNCC, 10.1016/j.jncc.2024.01.006 (2024). - PMC - PubMed
    1. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2020. CA Cancer J. Clin.70, 7–30 (2020). - PubMed
    1. Koo, M. M. et al. Presenting symptoms of cancer and stage at diagnosis: evidence from a cross-sectional, population-based study. Lancet Oncol.21, 73–79 (2020). - PMC - PubMed
    1. MacMahon, H. et al. Guidelines for management of incidental pulmonary nodules detected on ct images: From the fleischner society 2017. Radiology284, 228–243 (2017). - PubMed
    1. Gould, M. K. et al. Evaluation of individuals with pulmonary nodules: when is it lung cancer? Diagnosis and management of lung cancer, 3rd ed: American College of. Chest Physicians Evid.-based Clin. Pract. Guidel. Chest143, e93S–e120S (2013). - PMC - PubMed

Publication types

LinkOut - more resources