Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 29;15(1):10390.
doi: 10.1038/s41467-024-54771-4.

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Affiliations

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Zhaoxiang Cai et al. Nat Commun. .

Erratum in

Abstract

Integrating diverse types of biological data is essential for a holistic understanding of cancer biology, yet it remains challenging due to data heterogeneity, complexity, and sparsity. Addressing this, our study introduces an unsupervised deep learning model, MOSA (Multi-Omic Synthetic Augmentation), specifically designed to integrate and augment the Cancer Dependency Map (DepMap). Harnessing orthogonal multi-omic information, this model successfully generates molecular and phenotypic profiles, resulting in an increase of 32.7% in the number of multi-omic profiles and thereby generating a complete DepMap for 1523 cancer cell lines. The synthetically enhanced data increases statistical power, uncovering less studied mechanisms associated with drug resistance, and refines the identification of genetic associations and clustering of cancer cell lines. By applying SHapley Additive exPlanations (SHAP) for model interpretation, MOSA reveals multi-omic features essential for cell clustering and biomarker identification related to drug and gene dependencies. This understanding is crucial for developing much-needed effective strategies to prioritize cancer targets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: AstraZeneca, GlaxoSmithKline, and Astex Pharmaceuticals have awarded M.J.G. research grants and M.J.G. is founder and advisor at Mosaic Therapeutics. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Cancer multi-omics integration with MOSA.
a Cancer cell line multi-omic datasets across the 1523 cancer cell lines. Purple represents measured screens, while orange represents gaps, i.e., missing screens, which were synthetically generated with MOSA. b Schematic of the autoencoder, MOSA, where encoders are represented at the top and decoders at the bottom. For simplicity, the integration of only two datasets is represented. Highlighted designs of MOSA are illustrated on the right. Created in BioRender. Cai, Z. (2023) BioRender.com/m96b457. c Dimensionality reduction visualized using Uniform Manifold Approximation and Projection (UMAP) representation of the trained MOSA joint latent space, where each dot represents a cancer cell line colored according to its tissue of origin.
Fig. 2
Fig. 2. MOSA reconstruction of drug response and CRISPR-Cas9 datasets.
a MOSA reconstruction quality measured using a 10-fold cross-validation. After reconstructing all test folds, they are concatenated and the reconstruction quality score is calculated as the Pearson’s r between the reconstructed and actual measured values. Features ranked by their reconstruction quality are shown for the drug response (left) and the CRISPR-Cas9 (right) datasets. Duplicated drug names represent replicated screens for the same drug. Representative examples of strongly selective CRISPR-Cas9 and drug responses are labeled. b MOSA’s partial dataset augmentation (missing value imputation) of drug IC50s compared to recent independent drug response screens. ce, similar to b, using MOFA, MOVE and mean imputed values, respectively.
Fig. 3
Fig. 3. Multi-omics benchmark of MOSA.
a Distribution of proteomics cancer cell lines correlation with an independent dataset (CCLE) grouped by whether the cancer cell line had proteomic data for the model training (orange, n = 291) versus cell lines without any proteomics prior (light blue, n = 78). b Distribution of cancer cell line correlations (Pearson’s r) between an independent drug response dataset (CTD2,) and the MOSA reconstructed dataset, grouped by whether the cancer cell line had prior availability of drug response in the datasets for the model training (orange, n = 571) versus cell lines without drug response data (light blue, n = 239). c One-sided log-ratio test p-value of genetic associations with CRISPR-Cas9 gene essentiality with the original dataset (x-axis) and the augmented MOSA dataset (y-axis). False discovery rate (FDR) correction is applied using the Benjamini-Hochberg method to adjust for multiple comparisons. d Fisher skew test per gene across the original CRISPR-Cas9 dataset (x-axis) and the MOSA augmented dataset (y-axis). Dot size represents the number of cell lines that have the gene as essential (scaled log2 fold-change < −0.5) in the original dataset. e Correlation between BRAF and MAPK1 CRISPR-Cas9 gene essentialities using both previous measured (Observed) and the synthetically reconstructed (Reconstructed). Gene essentiality scores are represented using copy-number corrected log2 fold-changes scaled by the median of common essential (score = −1) and non-essential (score = 0) genes. Gene essentialities are also grouped according to the presence or absence of a BRAF mutation, mostly V600E gain-of-function mutations. f CRISPR-Cas9 gene essentiality association with FLI1-EWSR1 fusion. Confidence intervals of 95% are displayed for the regression lines in panels d, e, and f. Box-and-whisker plots show 1.5× interquartile ranges, centers indicate medians in panels e and f.
Fig. 4
Fig. 4. SHapley Additive exPlanations (SHAP) model explanation of MOSA.
a Top features from each omic layer that contribute the most to the multi-omic latent space. b Top drugs that have the highest feature importance from metabolite 1-methylnicotinamide. c Top features that contribute the most to the reconstruction of the drug response of Daraprim (Pyrimethamine).

References

    1. Trastulla, L., Noorbakhsh, J., Vazquez, F., McFarland, J. & Iorio, F. Computational estimation of quality and clinical relevance of cancer cell lines. Mol. Syst. Biol.18, e11017 (2022). - PMC - PubMed
    1. Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature483, 570–575 (2012). - PMC - PubMed
    1. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483, 603–607 (2012). - PMC - PubMed
    1. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature568, 511–516 (2019). - PubMed
    1. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell170, 564–576.e16 (2017). - PMC - PubMed

Publication types

Substances

LinkOut - more resources