Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 24;14(9):2111.
doi: 10.3390/cancers14092111.

A Novel Machine Learning 13-Gene Signature: Improving Risk Analysis and Survival Prediction for Clear Cell Renal Cell Carcinoma Patients

Affiliations

A Novel Machine Learning 13-Gene Signature: Improving Risk Analysis and Survival Prediction for Clear Cell Renal Cell Carcinoma Patients

Patrick Terrematte et al. Cancers (Basel). .

Abstract

Patients with clear cell renal cell carcinoma (ccRCC) have poor survival outcomes, especially if it has metastasized. It is of paramount importance to identify biomarkers in genomic data that could help predict the aggressiveness of ccRCC and its resistance to drugs. Thus, we conducted a study with the aims of evaluating gene signatures and proposing a novel one with higher predictive power and generalization in comparison to the former signatures. Using ccRCC cohorts of the Cancer Genome Atlas (TCGA-KIRC) and International Cancer Genome Consortium (ICGC-RECA), we evaluated linear survival models of Cox regression with 14 signatures and six methods of feature selection, and performed functional analysis and differential gene expression approaches. In this study, we established a 13-gene signature (AR, AL353637.1, DPP6, FOXJ1, GNB3, HHLA2, IL4, LIMCH1, LINC01732, OTX1, SAA1, SEMA3G, ZIC2) whose expression levels are able to predict distinct outcomes of patients with ccRCC. Moreover, we performed a comparison between our signature and others from the literature. The best-performing gene signature was achieved using the ensemble method Min-Redundancy and Max-Relevance (mRMR). This signature comprises unique features in comparison to the others, such as generalization through different cohorts and being functionally enriched in significant pathways: Urothelial Carcinoma, Chronic Kidney disease, and Transitional cell carcinoma, Nephrolithiasis. From the 13 genes in our signature, eight are known to be correlated with ccRCC patient survival and four are immune-related. Our model showed a performance of 0.82 using the Receiver Operator Characteristic (ROC) Area Under Curve (AUC) metric and it generalized well between the cohorts. Our findings revealed two clusters of genes with high expression (SAA1, OTX1, ZIC2, LINC01732, GNB3 and IL4) and low expression (AL353637.1, AR, HHLA2, LIMCH1, SEMA3G, DPP6, and FOXJ1) which are both correlated with poor prognosis. This signature can potentially be used in clinical practice to support patient treatment care and follow-up.

Keywords: clear cell renal cell carcinoma (ccRCC); feature selection; gene signature; kidney cancer; machine learning; mutual information; prognosis; survival analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests, and that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Figures

Figure A1
Figure A1
Number of papers published on PubMed by year on query performed in January 2021. Initially, in green, the gene signatures published in the period of 2015 to 2020 were selected to be compared. After the exclusion criteria, we obtained the 14 gene signatures.
Figure A2
Figure A2
Scatter plot of median of gene expression comparing TCGA-KIRC and ICGC-RECA gene expression. (a) Raw counts. (b) log2(count + 1) normalization. (c) Variance-stabilizing transformation with DESeq2. (d) Box-Cox transformation. (e) Scaling between zero and one (with Caret R package and ‘range’ method). (f) Scaling between zero and one (with BBmisc R package and ‘range’ method).
Figure A3
Figure A3
Variable ranking based on mutual information of 10 most important genes of mRMR 13-gene signature of ccRCC. The most representative genes with respect to AJCC Staging of TCGA dataset.
Figure A4
Figure A4
Collinearity analysis with variance inflation factors 13-gene signature of ccRCC. None of genes had variance inflation factors ≥ 5, indicating no collinearity or redundancy on the signature.
Figure A5
Figure A5
Correlation analysis between genes of mRMR 13-gene signature of ccRCC. No strong correlation between genes ≥ 0.70 was found, including the clinical data of age, overall survival status and AJCC staging.
Figure A6
Figure A6
Density plot of the distribution of overall patient survival in TCGA-KIRC and ICGC-RECA. The dotted line indicates the mean of distributions, and the solid lines indicate the time prediction used for internal and external validations. We restrict the 10-year prediction for TCGA-KIRC to exclude outliers in the long tail of the density plot of the patient’s overall survival. For the ICGC-RECA dataset, we decided to maintain a 7-year prediction in order to include all samples, and limit the time prediction to the range of distribution of this dataset for external validation.
Figure A7
Figure A7
Circular diagram of mRMR gene signature and the source of genes DEA, genes from GTEx portal of expression quantitative trait loci (eQTLs) in Kidney Cortex, and gene signatures from the literature.
Figure A8
Figure A8
Forest plot for Cox proportional hazards model displaying the significative genes (AL353637.1, DPP6, FOXJ1, HHLA2, and SAA1). The statistical significance between comparisons is given by * p-value < 0.05, ** p-value < 0.01, and *** p-value < 0.001.
Figure A9
Figure A9
Analysis performed using UALCAN portal with data of ccRCC from Clinical Proteomic Tumor Analysis Consortium (CPTAC) [50], available at http://ualcan.path.uab.edu/ (accessed on 1 March 2022). Z-values represent standard deviations from the median across samples for the given cancer type of ccRCC. The statistical significance between comparisons is given by * p-value < 0.05, ** p-value < 0.01, and *** p-value < 0.001. (a) Comparison of protein expression by cancer stages of AR gene. (b) Comparison of protein expression by cancer stages of GNB3. (c) Comparison of protein expression by cancer stages of HHLA2. (d) Comparison of protein expression by cancer stages of LIMCH1. (e) Comparison of protein expression by cancer stages of SAA1.
Figure A10
Figure A10
Heatmap with hierarchical clustering combining RNA-seq expression of patients on TCGA-KIRC and ICGC-RECA. Columns are genes of the mRMR signature. Rows indicate RNA-seq expression of 590 patients of TCGA-KIRC and ICGC-RECA. Data of patients with distant metastasis that cannot be assessed (MX) were removed in order to clarify the clustering.
Figure 1
Figure 1
Flowchart of the current study to obtain a gene signature based on mutual information, Minimum Redundancy Maximum Relevance (mRMR). The datasets are indicated by the cylinder, white rectangles represent a step of the analysis, and the blue rectangles indicate the resulting figures and tables. TCGA-KIRC and ICGC-RECA are datasets of ccRCC.
Figure 2
Figure 2
Selected genes through mRMR. (a) Venn diagram of prefiltered gene sets. A total of 3284 prefiltered genes is given by the sets of DEA between non-metastatic versus metastatic (156), normal tissues versus primary tumor (1775), genes from literature (221), significant eQTLs genes (1259), and 124 genes overlapping in two or three intersections of sets. (b) Volcano plot of DEA comparing normal tissues versus primary tumor samples of TCGA-KIRC. In green, we see the downregulated genes of normal tissues versus primary tumors (DPP6 and FOXJ1). In red, we see the upregulated genes (HHLA2, LINC01732, SAA1, AL353637.1, and ZIC2). In gray, we see the non significant genes with low fold change. (c) Volcano plot of DEA comparing non-metastatic versus metastatic samples. In red, we see the upregulated genes (OTX1 and ZIC2).
Figure 3
Figure 3
Benchmark with internal and external validation. (a) Comparison of 14 gene signatures from the literature and 6 feature selection on 8 models for survival risk, showing the predicted AUC of survival outcome in 10-years prediction. (b) Boxplots of results of each gene signature and feature selection for 7-year prediction.
Figure 4
Figure 4
Survival risk predictions with mRMR signature and dimensionality reduction. (a) The survival curves are predicted in three equal-size strata of risk groups of the TCGA-KIRC dataset: higher risk (red), lower risk (green), and moderate risk (orange). (b) A dimension reduction of genes from the mRMR signature, using principal components analysis. (c) The survival curves were predicted by validating the ICGC-RECA dataset. (d) The principal components analysis of the ICGC-RECA dataset with genes of mRMR signature.
Figure 5
Figure 5
Aalen’s additive Cox regression model for censored data of the mRMR signature, and the clinical features age and metastasis. (a) The dot-and-whisker plots with the estimated coefficients (β), z-score, their confidence intervals (95%), and the p-values. (b) Curves of each term for the censored data in relation to time (days).
Figure 6
Figure 6
Gene enrichment analysis. (a) Heatmap of enriched terms and relationships of genes, displaying the fold change of differential analysis of normal tissues versus primary tumors of TCGA-KIRC samples. (b) Enrichment analysis of gene-disease associations (GDAs) from DisGeNET (v7.0) of expert curated databases.

Similar articles

Cited by

References

    1. Hsieh J.J., Purdue M.P., Signoretti S., Swanton C., Albiges L., Schmidinger M., Heng D.Y., Larkin J., Ficarra V. Renal Cell Carcinoma. Nat. Rev. Dis. Primers. 2017;3:17009. doi: 10.1038/nrdp.2017.9. - DOI - PMC - PubMed
    1. Chen L., Xiang Z., Chen X., Zhu X., Peng X. A Seven-Gene Signature Model Predicts Overall Survival in Kidney Renal Clear Cell Carcinoma. Hereditas. 2020;157:38. doi: 10.1186/s41065-020-00152-y. - DOI - PMC - PubMed
    1. Cui H., Shan H., Miao M.Z., Jiang Z., Meng Y., Chen R., Zhang L., Liu Y. Identification of the Key Genes and Pathways Involved in the Tumorigenesis and Prognosis of Kidney Renal Clear Cell Carcinoma. Sci. Rep. 2020;10:1–10. doi: 10.1038/s41598-020-61162-4. - DOI - PMC - PubMed
    1. Society A.C. Facts & Figures: 2020 Edition. 2020. [(accessed on 1 March 2022)]. Available online: https://www.cancer.org/research/cancer-facts-statistics/all-cancer-facts....
    1. Padala S.A., Barsouk A., Thandra K.C., Saginala K., Mohammed A., Vakiti A., Rawla P., Barsouk A. Epidemiology of Renal Cell Carcinoma. World J. Oncol. 2020;11:79–87. doi: 10.14740/wjon1279. - DOI - PMC - PubMed