Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 22;25(21):11356.
doi: 10.3390/ijms252111356.

Unmasking Neuroendocrine Prostate Cancer with a Machine Learning-Driven Seven-Gene Stemness Signature That Predicts Progression

Affiliations

Unmasking Neuroendocrine Prostate Cancer with a Machine Learning-Driven Seven-Gene Stemness Signature That Predicts Progression

Agustina Sabater et al. Int J Mol Sci. .

Abstract

Prostate cancer (PCa) poses a significant global health challenge, particularly due to its progression into aggressive forms like neuroendocrine prostate cancer (NEPC). This study developed and validated a stemness-associated gene signature using advanced machine learning techniques, including Random Forest and Lasso regression, applied to large-scale transcriptomic datasets. The resulting seven-gene signature (KMT5C, DPP4, TYMS, CDC25B, IRF5, MEN1, and DNMT3B) was validated across independent cohorts and patient-derived xenograft (PDX) models. This signature demonstrated strong prognostic value for progression-free, disease-free, relapse-free, metastasis-free, and overall survival. Importantly, the signature not only identified specific NEPC subtypes, such as large-cell neuroendocrine carcinoma, which is associated with very poor outcomes, but also predicted a poor prognosis for PCa cases that exhibit this molecular signature, even when they were not histopathologically classified as NEPC. This dual prognostic and classifier capability makes the seven-gene signature a robust tool for personalized medicine, providing a valuable resource for predicting disease progression and guiding treatment strategies in PCa management.

Keywords: gene signature; large cell neuroendocrine carcinoma; machine learning; neuroendocrine transdifferentiation; prognosis; prostate cancer; stemness.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Figures

Figure 1
Figure 1
Stemness-associated gene expression changes in PCa patient samples using multiple public datasets. (A) Schematic representation of gene selection, transcriptomics, and survival analyses to define potential prognostic biomarkers. (B) (i) Volcano plots showing the results of the differential expression analyses of all available genes within the included transcriptomics datasets. Red = significantly upregulated stemness-associated gene. Blue = significantly downregulated stemness-associated gene. Dark gray = Non-significantly dysregulated stemness-associated genes. Light gray = other genes available in the dataset. (ii) Summary heatmap of the transcriptomics analyses performed in multiple publicly available datasets (n = 1259). Genes of interest and the results of the differential expression analysis for each dataset are displayed. Each row represents the results of a specific comparison. Annotation depicts the absolute number (#) of comparisons in which each gene is upregulated (red) or downregulated (blue). Red = significantly upregulated gene. Blue = significantly downregulated gene. White = not significant changes. Gray = not available. Datasets: GSE35988 (n = 122); GSE3933 (n = 103); GSE46602 (n = 50); GSE6956 (n = 87); GSE70768 (n = 179); TCGA-PRAD (n = 548); GSE21034 (n = 150). Statistical significance was set at adjusted p-value < 0.05.
Figure 2
Figure 2
Uni and multivariable survival analysis. (A) (i) Examples of Kaplan–Meier (KM) curves depicting the association of each gene to the risk of event (purple = high expression of a gene; green: low expression of a gene). HR: Hazard Ratio; Cox p: p-value from the Cox proportional hazards model. Log-rank p: p-value of the log-rank test. (ii) Summary heatmap of the univariable survival analyses performed on multiple datasets. (B) (i) Examples of forest plots depicting the association of each gene to the risk of event adjusted for all available covariates using the TCGA-PRAD dataset. (ii) Summary heatmap of the multivariable survival analyses performed on multiple datasets. Red boxes indicates that high gene expression is associated with a higher risk of an event (HR > 1 and Cox p < 0.05), blue boxes indicate that high gene expression is associated with a lower risk of survival-related events (HR < 1 and Cox p < 0.05), and white boxes indicate that there are no significant associations between gene expression and risk of an event. Gray = gene not available. Patients were stratified by the optimal cutoff for each gene, calculated using the Cutoff Finder tool. All comparisons consider low-expression patients as the reference group. Annotation depicts the absolute number (#) of comparisons in which high expression of each gene is associated with high (red) or low (blue) risk. OS: overall survival; DSS: disease-specific survival; PFS: progression-free survival; RFS: relapse-free survival; MFS: metastasis-free survival. Datasets: TCGA-PRAD (n = 497 PFS, n = 337 DFS); GSE70768 (n = 111 RFS); GSE70769 (n = 92 RFS); GSE116918 (n = 248 RFS and MFS); GSE16560 (n = 281 OS). Statistical significance was set at Cox p < 0.05. ** Cox p < 0.01; *** Cox p < 0.001.
Figure 3
Figure 3
Machine learning Random Forest algorithm for prognostic candidates’ selection. (A) Heatmap summarizing the relative importance of the variables (genes) for all training datasets. The relative importance was converted into percentiles, where 1 represents maximum relative importance (red) and 0 indicates minimum relative importance (blue). Gray = gene not available in the dataset. The 15 top-ranked genes (purple box) were selected as candidates for our stemness-associated risk signature. (B) (i) Example of Kaplan–Meier (KM) curve using the TCGA-PRAD dataset depicting the association of the seven-gene score to the risk of progression (purple = high seven-gene score; green: low seven-gene score). The coefficients for each gene were calculated by Lasso regression using TCGA-PRAD data, and the seven-gene score was constructed as follows: 0.284 × KMT5C + 0.272 × MEN1 + 0.218 × TYMS + 0.090 × IRF5 + 0.083 × DNMT3B + 0.048 × CDC25B − 0.060 × DPP4. Patients were stratified by the median of the score. HR: Hazard Ratio; p-value: p-value from the Cox proportional hazards model. Log-rank p: p-value of the Log-rank test. (ii) Summary forest plot displaying the survival analysis of the association of the seven-gene signature with the risk of disease-progression events in the training datasets. Patients’ survival was analyzed by either stratification by the median of the seven-gene score (circles) or taking the seven-gene score as a continuous variable (squares). Red corresponds to statistically significant associations (Cox p < 0.05) and gray corresponds to not significant associations. On the right, heatmap depicting the concordance index value for each of the analyses. The concordance index is a performance measure of the signature within each dataset. Cox p: p-value of the Cox regression coefficient. HR = Hazard Ratio. (95% CI) = 95% Confidence Interval. PFS: progression-free survival; DFS: disease-free survival; RFS: relapse-free survival; OS: overall survival; MFS: metastasis-free survival. Statistical significance was set at Cox p < 0.05 (red). * Cox p < 0.05; ** Cox p < 0.01; *** Cox p < 0.001; **** Cox p < 0.0001.
Figure 4
Figure 4
Gene signature’s performance across external validation datasets. (A) (i) Kaplan–Meier curves depicting the association of the seven-gene score to the risk of disease-progression events included in the validation datasets. (ii) Kaplan–Meier curve depicting the association of the seven-gene score to the risk of death of metastatic PCa patients from the SU2C dataset. The coefficients for each gene were calculated by Lasso regression using TCGA-PRAD data, and the seven-gene score was calculated as follows: 0.284 × KMT5C + 0.272 × MEN1 + 0.218 × TYMS + 0.090 × IRF5 + 0.083 × DNMT3B + 0.048 × CDC25B − 0.060 × DPP4. Patients were stratified by the median of the score. HR: Hazard Ratio; Cox p: p-value from the Cox proportional hazards model. Log-rank p: p-value of the Log-rank test. (iii) Summary forest plot displaying the survival analysis of the association of the seven-gene signature with the risk of disease-progression events in the validation datasets. Patients’ survival was analyzed by either stratification by the median of the seven-gene score (circles) or taking the seven-gene score as a continuous variable (squares). On the right, heatmap depicting the concordance index value for each of the analyses. The concordance index is a performance measure of the signature within each dataset. RFS: relapse-free survival; OS: overall survival. (B) Forest plots depicting the association of each gene to the risk of event adjusted for all available covariates within each validation dataset. Red corresponds to statistically significant associations (Cox p < 0.05) and gray corresponds to not significant associations. Cox p: p-value of the Cox regression coefficient. HR = Hazard Ratio [95% CI] = 95% Confidence Interval. Datasets: GSE54460 (n = 106); GSE94767 (n = 233); DKFZ (n = 81); SU2C-PCF (n = 81). Statistical significance was set at Cox p < 0.05 (red). * Cox p < 0.05; ** Cox p < 0.01; *** Cox p < 0.001; **** Cox p < 0.0001.
Figure 5
Figure 5
Transcriptomic analysis of the MDA PCa PDX series. (A) Schematic representation of the MDA PCa PDX series establishment and transcriptome analysis (n = 44) (created with BioRender.com). (B) (i) Heatmap depicting unsupervised clustering analysis of RNAseq data from the 44 MDA PCa PDX series considering the expression of the seven-gene signature (KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B, and DPP4). Red, white, and blue represent greater, intermediate, and lower gene expression levels. (ii) Violin plot showing the seven-gene score levels in no-NEPC and NEPC samples from the MDA PCa PDX series. (iii) Violin plots showing the expression levels (FPKM) of the genes included in the seven-gene score in no-NEPC and NEPC samples from the MDA PCa PDX series. (C) (i) PCA biplot considering the expression of the seven-gene signature using the MDA PCa PDX data assessed by RNA-seq. Each point represents one PDX. Samples are colored according to the histopathological classification: adenocarcinoma (red), sarcomatoid (beige), and neuroendocrine (purple). (ii) Bar plot showing the contribution (%) of each gene in the signature to the variance in the PC1 from the PCA. The red dashed line depicts the expected average contribution if all genes weighed the same (value = 14.29%). (D) ROC curve showing the performance of the seven-gene score in classifying MDA PCa PDX series as NEPC. 95% CI = 95% Confidence Interval. Statistical significance was calculated using Mann–Whitney test and was set at p < 0.05.
Figure 6
Figure 6
Clinical validation in NEPC samples. (A) (i) Heatmap depicting an unsupervised clustering analysis of RNAseq data from the MDA PCa PDX series considering the expression of the 70-gene signature proposed by Beltran et al. [12]. (ii) Heatmap depicting an unsupervised clustering analysis of RNAseq data from the MDA PCa PDX series considering the expression of the 70-gene signature proposed by Beltran et al. plus the 7 genes (KMT5C, MEN1, TYMS, IRF5, DNMT3B, CDC25B, and DPP4) from the risk score model proposed in our work. (B) (i) Heatmap depicting an unsupervised clustering analysis of RNAseq data from human patients in Beltran et al., dataset (n = 49) [12] considering the expression of the seven-gene signature. Red, white, and blue represent greater, intermediate, and lower gene expression levels. Expression values are presented as z-scores. (ii) Violin plot showing seven-gene score levels in Castration-Resistant Prostate Cancer-Adenocarcinoma (CRPC-Adeno) and CRPC-Neuroendocrine (NE) samples from the Beltran et al., dataset. (C) (i) Violin plot showing risk score levels in samples from the Beltran et al., dataset according to the histological classification: prostate adenocarcinoma without NE differentiation, prostate adenocarcinoma with NE differentiation >20%, small-cell carcinoma, large-cell NE carcinoma, and mixed small-cell carcinoma–adenocarcinoma. (ii) ROC curve showing the performance of the seven-gene score in classifying PCa patient samples from Beltran et al.’s dataset as Large-Cell NEPC. 95% CI = 95% Confidence Interval. Statistical significance was calculated using Mann–Whitney test or ANOVA followed by Tukey’s test and was set at p < 0.05.

Update of

Similar articles

Cited by

References

    1. Bray F., Laversanne M., Sung H., Ferlay J., Siegel R.L., Soerjomataram I., Jemal A. Global Cancer Statistics 2022: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. [(accessed on 18 July 2024)];CA Cancer J. Clin. 2024 74:229–263. Available online: https://acsjournals.onlinelibrary.wiley.com/doi/10.3322/caac.21834. - DOI - PubMed
    1. Beltran H., Rickman D.S., Park K., Chae S.S., Sboner A., MacDonald T.Y., Wang Y., Sheikh K.L., Terry S., Tagawa S.T., et al. Molecular Characterization of Neuroendocrine Prostate Cancer and Identification of New Drug Targets. Cancer Discov. 2011;1:487–495. doi: 10.1158/2159-8290.CD-11-0130. - DOI - PMC - PubMed
    1. Robinson D., Van Allen E.M., Wu Y.-M., Schultz N., Lonigro R.J., Mosquera J.-M., Montgomery B., Taplin M.-E., Pritchard C.C., Attard G., et al. Integrative Clinical Genomics of Advanced Prostate Cancer. Cell. 2015;161:1215–1228. doi: 10.1016/j.cell.2015.05.001. - DOI - PMC - PubMed
    1. Liu C., Kelnar K., Liu B., Chen X., Calhoun-Davis T., Li H., Patrawala L., Yan H., Jeter C., Honorio S., et al. The microRNA miR-34a Inhibits Prostate Cancer Stem Cells and Metastasis by Directly Repressing CD44. Nat. Med. 2011;17:211–215. doi: 10.1038/nm.2284. - DOI - PMC - PubMed
    1. Al Salhi Y., Sequi M.B., Valenzi F.M., Fuschi A., Martoccia A., Suraci P.P., Carbone A., Tema G., Lombardo R., Cicione A., et al. Cancer Stem Cells and Prostate Cancer: A Narrative Review. Int. J. Mol. Sci. 2023;24:7746. doi: 10.3390/ijms24097746. - DOI - PMC - PubMed

Substances