Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 18;6(2):101924.
doi: 10.1016/j.xcrm.2024.101924. Epub 2025 Jan 22.

Multimodal integration using a machine learning approach facilitates risk stratification in HR+/HER2- breast cancer

Affiliations

Multimodal integration using a machine learning approach facilitates risk stratification in HR+/HER2- breast cancer

Hang Zhang et al. Cell Rep Med. .

Abstract

Hormone receptor-positive (HR+)/human epidermal growth factor receptor 2-negative (HER2-) breast cancer is the most common type of breast cancer, with continuous recurrence remaining an important clinical issue. Current relapse predictive models in HR+/HER2- breast cancer patients still have limitations. The integration of multidimensional data represents a promising alternative for predicting relapse. In this study, we leverage our multi-omics cohort comprising 579 HR+/HER2- breast cancer patients (200 patients with complete data across 7 modalities) and develop a machine-learning-based model, namely CIMPTGV, which integrates clinical information, immunohistochemistry, metabolomics, pathomics, transcriptomics, genomics, and copy number variations to predict recurrence risk of HR+/HER2- breast cancer. This model achieves concordance indices (C-indices) of 0.871 and 0.869 in the train and test sets, respectively. The risk population predicted by the CIMPTGV model encompasses those identified by single-modality models. Feature analysis reveals that synergistic and complementary effects exist in different modalities. Simultaneously, we develop a simplified model with a mean area under the curve (AUC) of 0.840, presenting a useful approach for clinical applications.

Keywords: HR+/HER2− breast cancer; machine learning; multimodal integration; risk stratification.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Cohort development and machine learning framework construction (A) Upset plot illustrated the inclusive intersection sizes for different combinations of modalities. The vertical bars in the upper plot represented the number of patients in each modality combination, indicated by the black circles in the plot located below. The modalities included clinical information (N = 547), immunohistochemistry data (N = 510), transcriptomics data (N = 565), metabolomics data (N = 380), genomics data (N = 467), copy-number variation data (N = 429), and pathomics data (N = 418). The set sizes vary from 200 to 565. (B) The machine learning flowchart consists of steps including data partitioning, feature extraction, dimension reduction, model training, and independent validation. The train and test sets varied according to the specific combinations of modalities used. Abbreviations: IHC, immunohistochemistry; CNV, copy-number variation; CV, cross-validation.
Figure 2
Figure 2
Multimodal integration improves the prediction efficacy and benefits risk stratification (A and B) The C-indices of models combining different modalities were presented for both the train set (A, left) and the test set (B, left). Different model types with different numbers of modalities were distinguished by color. Comparisons of C-indices between models with different numbers of modalities (right) were shown. The interior boxes represented the 25th, 50th, and 75th percentiles. The bars indicated the C-indices of each model, and the lower and upper points of the error bars in the test set represented the 95% confidence intervals, derived from 1,000-fold bootstrapping. Dots indicated individual C-index, and the colors differentiated between model types (∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001, p values were obtained from a Mann-Whitney U-test). (C) Kaplan-Meier survival analyses were conducted to compare RFS, OS, and DMFS between the high-risk and low-risk groups stratified by the CIMPTGV model in both the train and test sets. The p value was obtained from the log rank test. (D) Hazard ratio with 95% confidence intervals for each model in the test set calculated by using a univariate Cox regression analysis. Abbreviations: IHC, immunohistochemistry; CNV, copy-number variation; Patho, pathomics; C-index, concordance index; RFS, relapse-free survival; OS, overall survival; DMFS, distant metastasis-free survival.
Figure 3
Figure 3
Orthogonal data exist in multiple modalities, thus improving predictive efficacy (A) Pearson correlation coefficients of the prediction scores of each single-modality model in the test set. (B) A Venn diagram comparing the high-risk patients identified by each single-modality model in the test set. For each single-modality model, individual predictive scores were generated, and patients with predictive scores above the cutoff score were selected as risk populations. The number of common risk populations between different models was indicated in the figure. (C) Heatmap showing the scaled predicted risk score from the CIMPTGV model and other single-modality models in the test set, with RFS status annotated above. Abbreviations: IHC, immunohistochemistry; CNV, copy-number variation; RFS, relapse-free survival.
Figure 4
Figure 4
Manifestation of features of each modality in the CIMPTGV model (A) Heatmap illustrating the Z scores of multimodal features used in the CIMPTGV model, with patients arranged according to the predicted risk scores generated by the model. Clinical, IHC, and selected genomics features were annotated above the heatmap, while metabolomics, pathomics, transcriptomics, and CNV were displayed within the heatmap. Somatic mutations of genes were displayed beneath the heatmap. (B) Comparisons of pN, pT, Ki-67 proliferation index, fatty acid metabolism pathway, MYC target pathway, and tumor cell number between risk groups (∗p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001, p value was obtained from the chi-square test and the Mann-Whitney U-test). Abbreviations: BMI, body mass index; TMB, tumor mutational burden; LOH, loss of heterozygosity; HRD, homologous recombination deficiency; GLCM, gray-level co-occurrence matrix; ASM, angular second moment of the co-occurrence matrix; I, immune cell; T, tumor cell; S, stroma cell; MYC, Myc proto-oncogene; HR, high risk; LR, low risk.
Figure 5
Figure 5
Modality correlation analysis further supports the existence of complementary information (A) Comparison of HRD scores between the high-risk and low-risk groups stratified by the CIMPTGV model. Dots depicted individual HRD score, and the colors indicated different groups (∗∗p < 0.01; the p value was obtained from a two-sided t test). (B) Pearson correlation (two-sided) between the 11q 13.3 copy numbers and HRD score, with shaded regions representing 95% confidence intervals. The p value was calculated using a two-sided t test. (C) Comparison of somatic copy numbers of three individual genes located at 11q13.3 between the high- and low-HRD-score groups, stratified by using median HRD score as cutoff (∗∗p < 0.01; the p value was obtained from a two-sided t test). (D) Pearson correlation (two-sided) between the somatic copy numbers and gene mRNA levels of three individual genes located at 11q13.3. Shaded regions represented 95% confidence intervals. The p value was calculated using a two-sided t test. (E and F) Comparison of the spatial betweenness of tumor cells, the number of tumor cells, and MITH scores between the high- and low-HRD-score groups stratified by using median HRD score as cutoff (∗p < 0.05; the p value was obtained from a two-sided t test). (G) Comparison of the abundance of nucleic acid metabolites in the high- and low-HRD groups (∗p < 0.05; the p value was obtained from a two-sided t test). Abbreviations: HRD, homologous recombination repair; FGF3, fibroblast growth factor 3; FGF4, fibroblast growth factor 4; CTTN, cortactin; MITH, morphological intratumor heterogeneity; HR, high risk; LR, low risk.
Figure 6
Figure 6
Simplified model construction promotes clinical application (A) The workflow for constructing the simplified CIMPTGV model. Initially, all features from clinical information, immunohistochemistry and pathomics dimensions were incorporated. High-importance features from these dimensions were selected, and the filtered feature matrix was input into the machine learning framework to generate the predictive score. (B) The scaled feature importance scores of features from different categories used in the simplified CIMPTGV model. Features were ranked by importance scores, with distinct feature categories represented by different colors. (C) The area under the curve (AUC) of the time-dependent receiver operating characteristic (ROC) curve was computed for five models: CT (blue), C (orange), T (green), CIMPTGV (red), and the simplified CIMPTGV (purple) model. The mean AUC for each model was presented, with distinct colors used to differentiate the models. (D) Kaplan-Meier survival analyses comparing RFS, OS, and DMFS between the high-risk and low-risk groups stratified by the S-CIMPTGV model in the test set. The p values were obtained from the log rank test. Abbreviations: BMI, body mass index; ER, estrogen receptor; PR, progesterone receptor; VEGF, vascular endothelial growth factor; SPTA1, spectrin alpha, erythrocytic 1; ABCA13, ATP binding cassette subfamily A member 13; AKT1, AKT serine/threonine kinase 1; S-CIMPTGV, simplified CIMPTGV; AUC, area under the curve; CT, model integrated clinical information and transcriptomics; HR, high risk; LR, low risk.

References

    1. Sung H., Ferlay J., Siegel R.L., Laversanne M., Soerjomataram I., Jemal A., Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA. Cancer J. Clin. 2021;71:209–249. doi: 10.3322/caac.21660. - DOI - PubMed
    1. Huppert L.A., Gumusay O., Idossa D., Rugo H.S. Systemic therapy for hormone receptor-positive/human epidermal growth factor receptor 2-negative early stage and metastatic breast cancer. CA. Cancer J. Clin. 2023;73:480–515. doi: 10.3322/caac.21777. - DOI - PubMed
    1. Colleoni M., Sun Z., Price K.N., Karlsson P., Forbes J.F., Thürlimann B., Gianni L., Castiglione M., Gelber R.D., Coates A.S., Goldhirsch A. Annual Hazard Rates of Recurrence for Breast Cancer During 24 Years of Follow-Up: Results From the International Breast Cancer Study Group Trials I to V. J. Clin. Oncol. 2016;34:927–935. doi: 10.1200/jco.2015.62.3504. - DOI - PMC - PubMed
    1. Haricharan S., Punturi N., Singh P., Holloway K.R., Anurag M., Schmelz J., Schmidt C., Lei J.T., Suman V., Hunt K., et al. Loss of MutL Disrupts CHK2-Dependent Cell-Cycle Control through CDK4/6 to Promote Intrinsic Endocrine Therapy Resistance in Primary Breast Cancer. Cancer Discov. 2017;7:1168–1183. doi: 10.1158/2159-8290.Cd-16-1179. - DOI - PMC - PubMed
    1. Ellis M.J., Suman V.J., Hoog J., Lin L., Snider J., Prat A., Parker J.S., Luo J., DeSchryver K., Allred D.C., et al. Randomized phase II neoadjuvant comparison between letrozole, anastrozole, and exemestane for postmenopausal women with estrogen receptor-rich stage 2 to 3 breast cancer: clinical and biomarker outcomes and predictive value of the baseline PAM50-based intrinsic subtype--ACOSOG Z1031. J. Clin. Oncol. 2011;29:2342–2349. doi: 10.1200/jco.2010.31.6950. - DOI - PMC - PubMed