Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 22;26(1):bbae665.
doi: 10.1093/bib/bbae665.

Bayesian unsupervised clustering identifies clinically relevant osteosarcoma subtypes

Affiliations

Bayesian unsupervised clustering identifies clinically relevant osteosarcoma subtypes

Sergio Llaneza-Lago et al. Brief Bioinform. .

Abstract

Identification of cancer subtypes is a critical step for developing precision medicine. Most cancer subtyping is based on the analysis of RNA sequencing (RNA-seq) data from patient cohorts using unsupervised machine learning methods such as hierarchical cluster analysis, but these computational approaches disregard the heterogeneous composition of individual cancer samples. Here, we used a more sophisticated unsupervised Bayesian model termed latent process decomposition (LPD), which handles individual cancer sample heterogeneity and deconvolutes the structure of transcriptome data to provide clinically relevant information. The work was performed on the pediatric tumor osteosarcoma, which is a prototypical model for a rare and heterogeneous cancer. The LPD model detected three osteosarcoma subtypes. The subtype with the poorest prognosis was validated using independent patient datasets. This new stratification framework will be important for more accurate diagnostic labeling, expediting precision medicine, and improving clinical trial success. Our results emphasize the importance of using more sophisticated machine learning approaches (and for teaching deep learning and artificial intelligence) for RNA-seq data analysis, which may assist drug targeting and clinical management.

Keywords: RNA-seq; heterogeneity; latent process decomposition; osteosarcoma; precision medicine.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Latent process decomposition model optimization, subtype assignment and clinical outcome. (a) Hyperparameter optimization for the TARGET dataset. LPD assesses the explanatory power of different combinations of sigma values (process spread) and the number of processes. The optimal combination is determined as the point of maximum log-likelihood before the onset of overfitting, visually identified as a plateau in the curves. For the TARGET dataset, the optimal parameters were three processes and a sigma value of −0.0001. (b) Sample assignment to subtypes. Bar plot illustrates sample assignment to the three identified subtypes based on their degree of membership (gamma value). Higher gamma values indicate stronger membership in a specific subtype reflecting the extent to which each subtype captures sample-specific transcriptomic variability. (c) Kaplan–Meier curves illustrate the survival probability over time for each subtype. Pairwise comparisons between subtypes are shown with log-rank p-values and sample sizes provided for each comparison.
Figure 2
Figure 2
Correlation of gene expression profiles between poor prognosis TARGET LPD-1 and corresponding subtypes. Scatter plots comparing the expression levels of the top 500 most variable transcripts across the entire TARGET dataset between TARGET LPD-1 and the corresponding most similar subtypes from the GREEN (GREEN LPD-1), PERRY (PERRY LPD-2), and SCOTT (SCOTT LPD-1) datasets. Trend lines and Pearson correlation coefficients (r) with corresponding P-values are displayed for each comparison.
Figure 3
Figure 3
Overlap of DE transcripts. Venn diagram illustrating the overlap of DE transcripts between TARGET LPD-1 and the most closely correlated subtypes from the GREEN, PERRY, and SCOTT datasets. The diagram quantifies the number of DE transcripts in each dataset and identifies eight transcripts shared across all four poor prognoses datasets.
Figure 4
Figure 4
Comparative evaluation of traditional clustering methods. (a) Silhouette analysis to determine the optimal number of clusters for hierarchical and k-means clustering in the TARGET dataset. Three clusters were identified as optimal, with six clusters showing similar performance. (b) Kaplan–Meier survival curves comparing patient survival based on hierarchical and k-means clustering groups using both three and six clusters as suggested by the silhouette analysis. Log-rank test was used to assess statistical significance.

Similar articles

Cited by

References

    1. Bolton KL, Chen D, Corona de la Fuente R. et al. . Molecular subclasses of clear cell ovarian carcinoma and their impact on disease behavior and outcomes. Clin Cancer Res 2022;28:4947–56. 10.1158/1078-0432.CCR-21-3817. - DOI - PMC - PubMed
    1. Morselli Gysi D, Barabási AL. Noncoding RNAs improve the predictive power of network medicine. Proc Natl Acad Sci USA 2023;120:e2301342120. 10.1073/pnas.2301342120. - DOI - PMC - PubMed
    1. Green D, Ewijk R, Tirtei E. et al. . Biological sample collection to advance research and treatment: A fight osteosarcoma through European research (FOSTER) and euro Ewing consortium (EEC) statement. Clin Cancer Res 2024;30:3395–406. - PMC - PubMed
    1. Sorlie T, Tibshirani R, Parker J. et al. . Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 2003;100:8418–23. - PMC - PubMed
    1. Yeh JM, Ward ZJ, Chaudhry A. et al. . Life expectancy of adult survivors of childhood cancer over 3 decades. JAMA Oncol 2020;6:350–7. 10.1001/jamaoncol.2019.5582. - DOI - PMC - PubMed