Systematic assessment of prognostic molecular features across cancers

Balaji Santhanam^{1

2

3

4}, Panos Oikonomou^{1

2

3

4}, Saeed Tavazoie^{1

2

3

4}

Affiliations

¹ Department of Biological Sciences, Columbia University, New York, NY 10027, USA.
² Department of Systems Biology, Columbia University, New York, NY 10032, USA.
³ Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
⁴ Irving Institute for Cancer Dynamics, Columbia University, New York, NY 10032, USA.

PMID: 36950380
PMCID: PMC10025453
DOI: 10.1016/j.xgen.2023.100262

Systematic assessment of prognostic molecular features across cancers

Balaji Santhanam et al. Cell Genom. 2023.

. 2023 Feb 2;3(3):100262.

doi: 10.1016/j.xgen.2023.100262. eCollection 2023 Mar 8.

Authors

Balaji Santhanam^{1

2

3

4}, Panos Oikonomou^{1

2

3

4}, Saeed Tavazoie^{1

2

3

4}

Affiliations

¹ Department of Biological Sciences, Columbia University, New York, NY 10027, USA.
² Department of Systems Biology, Columbia University, New York, NY 10032, USA.
³ Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
⁴ Irving Institute for Cancer Dynamics, Columbia University, New York, NY 10032, USA.

PMID: 36950380
PMCID: PMC10025453
DOI: 10.1016/j.xgen.2023.100262

Abstract

Precision oncology promises accurate prediction of disease trajectories by utilizing molecular features of tumors. We present a systematic analysis of the prognostic potential of diverse molecular features across large cancer cohorts. We find that the mRNA expression of biologically coherent sets of genes (modules) is substantially more predictive of patient survival than single-locus genomic and transcriptomic aberrations. Extending our analysis beyond existing curated gene modules, we find a large novel class of highly prognostic DNA/RNA cis-regulatory modules associated with dynamic gene expression within cancers. Remarkably, in more than 82% of cancers, modules substantially improve survival stratification compared with conventional clinical factors and prominent genomic aberrations. The prognostic potential of cancer modules generalizes to external cohorts better than conventionally used single-gene features. Finally, a machine-learning framework demonstrates the combined predictive power of multiple modules, yielding prognostic models that perform substantially better than existing histopathological and clinical factors in common use.

Keywords: cancer genomics; cancer regulatory networks; precision oncology; prognostic cancer biomarkers.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Expression changes are more prognostic than copy-number aberrations or gene mutations (A) Proportion of genes prognostic based on their copy-number aberrations (purple), mutation statuses (black), and expression changes (yellow) in each cohort (y axis) for both overall survival and progression-free interval survival. (B) Comparisons between strengths of prognosis conveyed by gene expression (y axis) and genomic aberrations (x axis) across all cohorts is visualized as a heatmap scatterplot (density indicated). Within cohorts, absolute values of standardized significance (Wald statistic) of each stratification are scaled, and their distributions for prognosis utilizing gene expression (y axis) and genomic aberrations (mutation or copy number; x axis) are plotted. Only genes that are prognostic for at least one of the three features (mutation, copy number, or expression) are included. (C) Schematic for the quantification of MPS. Mutual information is used to quantify the degree to which module membership is informative of gene expression levels in patient samples, which is then signed by the Pearson’s correlation coefficient between them to yield MPS (STAR Methods). (D) Kaplan-Meier (KM) plot shows patients with pancreatic cancer with positive perturbation scores (red) of a module corresponding to genes up-regulated in cell lines that harbor mutations at the TP53 locus relative to cell lines that are wild type for TP53 (MSigDB, M2698; 198 genes) and have worse overall survival (OVS) than patients with negative perturbation scores (blue). Statistics (p value and number of samples in the two groups) are indicated (STAR Methods). KM plot of patients stratified based on TP53 mutation status in pancreatic cancer cohort is also shown (black and gray lines). (E) The log₂ ratio of the absolute standardized significance of modules associated with cancer drivers and measurements on genes encoding these cancer drivers (in rows) are visualized in 19 cancers from TCGA. Standardized significance (Wald statistic) for individual genes were chosen to be the maximum from expression-, copy-number-, and mutation-based patient stratifications in each cohort. For the corresponding modules, the standardized significance scores were summarized using Stouffer’s method (STAR Methods). See also Figures S1–S8.

**Figure 2**
Systematic discovery of prognostic cancer modules (A) Schematic for discovering prognostic cancer modules (PCMs). For every module, z scored transcriptome-wide data (heatmap) are systematically transformed into MPS across samples (heatmap; bottom). Patients with significant module activation (repression) have positive (negative) MPS values and correspond to transcriptomes in which genes in the module are activated (repressed) and labeled MPS⁺ (MPS^–). Patient samples are stratified into MPS⁺ and MPS^– groups to quantify survival differences. Modules whose perturbations resulted in patient stratification with significantly different survival trajectories are considered to be prognostic (STAR Methods). Confounding effects of conventional clinical and histopathological factors are controlled by jointly modeling effects of these covariates along with MPS on patient survival. The predictive performance of these PCMs is evaluated on independent external cohorts. (B) Patients with stomach cancer with significant module activation (MPS⁺; red) for genes harboring at least one instance of the binding site for POU1F1 (MSigDB, M15591; 233 genes) have worse OVS than patients with significant module repression (MPS^–; blue). (C) Patients with melanoma with significant module activation (MPS⁺; red) for genes carrying at least one instance of RBM28 binding sites in their 3′ UTRs (CISBP-RNA; 1,595 genes) have worse progression-free interval survival (PFS) compared with samples with significant module repression (MPS^–; blue). For KM plot comparisons, statistics (median survival times, log₂ hazard ratio, and p value) are indicated, and survival of the rest of the samples is shown in gray. (D) Standardized significance of MPS-based patient survival (Wald statistic) of regulator-based modules are shown for OVS (left panel) and PFS (right panel). Regulator-based modules recurrently prognostic in 3 or more cancers are grouped together, and each row corresponds to the exemplar module within a cluster (STAR Methods). For each module, patients in a cohort were stratified into MPS⁺ and MPS^– groups to quantify survival differences between the two groups. Positive (or negative) values indicate better (or worse) survival of patients in the MPS⁺ group. (E) The log₂ ratio of the absolute standardized significance of module perturbations associated with regulators and measurements on their corresponding single genomic loci (in rows) are visualized for OVS (left panel) and PFS (right panel). For single-locus measurements, standardized significance (Wald statistic) was chosen to be the maximum from expression-, copy-number-, or mutation-based patient stratifications in each cohort, and for their associated modules, standardized significance scores were summarized using Stouffer’s methods (STAR Methods). See also Figures S13–S15.

**Figure 3**
Patient survival stratification based on de-*novo*-discovered PCMs and conserved prognosis in independent cancer cohorts (A) Patients with breast cancer with significant module activation (MPS⁺; red) for transcripts harboring at least one instance of the DNA motif HWRTNACGH (logo shown; 2,370 genes) within the first 1 kb of their promoters showed worse OVS than patients with significant module repression (MPS^–; blue). (B) Patients with prostate cancer with significant module activation (MPS⁺; red) for transcripts harboring at least one instance of the DNA motif DTTTMCAM (logo shown; 3,779 genes) within the first 1 kb of their promoters showed better PFS compared with patients with significant module repression (MPS^–; blue). (C) Patients with stomach cancer with significant module activation (MPS⁺; red) for transcripts harboring at least one instance of the linear RNA motif WSUUCAMR (logo shown; 1,872 genes) within the first 1 kb of their 3′ UTRs showed worse OVS compared with patients with significant module repression (MPS^–; blue). (D) Patients with colon cancer with significant module activation (MPS⁺; red) for transcripts harboring at least one instance of the structural RNA motif (logo and putative secondary structure indicated; 399 genes) within the first 1 kb of their 3′ UTRs showed better OVS than patients with significant module repression (MPS^–; blue). Select list of significant Gene Ontology terms enriched in each PCM are shown (bottom panel). For all KM plot comparisons, statistics (median survival times, log₂ hazard ratio, and p value) are indicated, and survival of the rest of the samples is shown in gray. For visualization, the time axis of KM curves is trimmed when the percentage of samples in MPS⁺ or MPS^– groups falls below 5%. (E) *De novo cis*-regulatory PCMs that are recurrently prognostic in 3 or more TCGA cohorts co-cluster based on the similarities in their module memberships (modified Jaccard score; STAR Methods). Heatmaps show module co-clustering probabilities with six broad clusters (color key indicated) revealed by consensus clustering. Selected set of significant Gene Ontology terms associated with genes that are common to at least 75% of the modules in each cluster as well as prominent tumor suppressors (italicized) and oncogenes in this list are tabulated. p values indicating over-representation of GO terms (hypergeometric test) are indicated. (F) Percentage of modules based on their perturbation scores (orange bars) or individual genes based on their expression (blue bars) that are consistently prognostic in tissue-matched independent cohorts (STAR Methods). p values indicate the significance of overlap for the modules (STAR Methods). (G) Distributions of area under the receiver operating characteristic curves (AUC) are shown for MPS (orange) and single genes (blue) to predict patient prognosis on tissue-matched independent cohorts (STAR Methods). The p values for comparisons between them (one-sided Mann-Whitney test ∗∗∗p < 10⁻⁵) are indicated. See also Figures S16–S19.

**Figure 4**
PCMs in clinically relevant patient sub-groups (A) Female patients with glioblastoma multiforme (GBM) with significant module activation (MPS⁺; red) for genes involved in chromatin organization (272 genes) showed better OVS than patients with significant module repression (MPS^–; blue). Also shown are survival curves for MPS⁺ and MPS^– male patients with GBM (dashed lines). Number of patients is indicated in parentheses. (B) Patients with stage II/IIA/IIB breast cancer with significant module activation (MPS⁺; red) for transcripts harboring at least one instance of the RNA motif URUAMGGD (logo shown; 1,082 genes) within the first 1 kb of their 3′ UTRs showed worse OVS than samples with significant module repression (MPS^–; blue). (C) Volcano plot shows modules associated with Gene Ontology annotations and pathways (gray dots), regulator binding sites (green dots), and de-*novo*-discovered DNA-motif based modules (orange dots) that are clinically prognostic for PFS (hazard ratio: x axis; p value: y axis) in patients with KRAS-mutated lung adenocarcinoma. (D) Patients with head and neck cancer and sarcoma with mutated TP53 and significant module activation (MPS⁺; red) for genes annotated to be involved in mRNA processing (243 genes) have worse OVS than patients with significant module repression (MPS^–; blue). For the KM plots, statistics of the comparison (median survival times, log₂ hazard ratio, and p value) are indicated. See also Figures S23 and S24.

**Figure 5**
Models based on PCMs are predictive of survival beyond conventionally used clinical factors (A) Schematic for quantifying the combined predictive power of multiple PCMs and its relative strength compared with clinical factors in common use. (B and C) KM plots show patient stratification based on risk predictions (high risk: purple; low risk: green) from a random survival forest model trained on PCMs (see STAR Methods) in (B) head and neck cancer (OVS) and (C) low-grade glioma (PFS). Statistics of the survival comparisons and number of patients are indicated. (D) Comparisons of random survival forest performance for predicting PFS in models trained on perturbation scores of PCMs (orange) and standard clinical factors (blue) across TCGA cancer cohorts. Each cross corresponds to the median hazard ratio from 10 different instances of 10-fold cross-validated models, and p values for comparisons between hazard ratios from the two models are indicated for each cohort (one-sided Mann-Whitney test ∗∗∗p < 10⁻⁵; ∗∗p < 10⁻⁴; ∗p < 5 × 10⁻³). (E) Comparisons of random survival forest performance for predicting OVS in models trained on perturbation scores of PCMs (orange) and prominent genomic aberrations (SNVs and CNAs) (green) across TCGA cancer cohorts. Each cross corresponds to the median hazard ratio from 10 different instances of 10-fold cross-validated models, and p values for comparisons between hazard ratios from the two models are indicated for each cohort (one-sided Mann-Whitney test ∗∗∗p < 10⁻⁵; ∗∗p < 10⁻⁴; ∗p < 5 × 10⁻³). (F–H) KM plots show patient survival stratification based on risk predictions (low risk: solid lines; high risk: dotted lines) from one of the 10 instances of random survival forest models. Predictions from this model trained using conventional clinical features and perturbation scores of PCMs are in dark red, while predictions of the comparable model without PCMs are in gray. Survival curves and associated statistics (p value and hazard ratio indicated) are for the instance with the largest difference in hazard ratios between the two random survival forest models. Survival comparisons are made using the same number of patients in each risk group, and total number of patients in each model (n) are indicated. (F) In patients with breast cancer (n = 500), KM plots show patient stratification for OVS using standard clinical factors augmented by perturbation scores of PCMs. (G) In patients with sarcoma (n = 100), KM plots show patient stratification for OVS using standard clinical features and SNVs augmented by perturbation scores of PCMs. (H) In patients with pancreatic cancer (n = 100), KM plots show patient stratification for PFS using standard clinical features and CNAs augmented by perturbation scores of PCMs. See also Figures S25, S28, and S30.

See this image and copyright information in PMC

References

1. Shaw K.R.M., Maitra A. The status and impact of clinical tumor Genome sequencing. Annu. Rev. Genomics Hum. Genet. 2019;20:413–432. doi: 10.1146/annurev-genom-083118-015034. - DOI - PubMed
1. Gilks C.B., Oliva E., Soslow R.A. Poor interobserver reproducibility in the diagnosis of high-grade endometrial carcinoma. Am. J. Surg. Pathol. 2013;37:874–881. doi: 10.1097/PAS.0b013e31827f576a. - DOI - PubMed
1. Shih A.R., Uruga H., Bozkurtlar E., Chung J.-H., Hariri L.P., Minami Y., Wang H., Yoshizawa A., Muzikansky A., Moreira A.L., et al. Problems in the reproducibility of classification of small lung adenocarcinoma: an international interobserver study. Histopathology. 2019;75:649–659. doi: 10.1111/his.13922. - DOI - PubMed
1. Smith J.C., Sheltzer J.M. Systematic identification of mutations and copy number alterations associated with cancer patient prognosis. Elife. 2018;7:e39217. doi: 10.7554/eLife.39217. - DOI - PMC - PubMed
1. Poirion O.B., Jing Z., Chaudhary K., Huang S., Garmire L.X. DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med. 2021 doi: 10.1186/s13073-021-00930-x. - DOI - PMC - PubMed

Grants and funding

R01 CA257153/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic assessment of prognostic molecular features across cancers

Affiliations

Systematic assessment of prognostic molecular features across cancers

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources