. 2024 Nov 29;15(1):10390.

doi: 10.1038/s41467-024-54771-4.

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Zhaoxiang Cai^#¹, Sofia Apolinário^#^{2

3}, Ana R Baião^{2

3}, Clare Pacini⁴, Miguel D Sousa^{2

3}, Susana Vinga^{2

3}, Roger R Reddel¹, Phillip J Robinson¹, Mathew J Garnett⁴, Qing Zhong⁵, Emanuel Gonçalves^{6

7}

Affiliations

¹ ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia.
² INESC-ID, 1000-029, Lisboa, Portugal.
³ Instituto Superior Técnico (IST), Universidade de Lisboa, 1049-001, Lisboa, Portugal.
⁴ Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK.
⁵ ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia. qzhong@cmri.org.au.
⁶ INESC-ID, 1000-029, Lisboa, Portugal. emanuel.v.goncalves@tecnico.ulisboa.pt.
⁷ Instituto Superior Técnico (IST), Universidade de Lisboa, 1049-001, Lisboa, Portugal. emanuel.v.goncalves@tecnico.ulisboa.pt.

^# Contributed equally.

PMID: 39614072
PMCID: PMC11607321
DOI: 10.1038/s41467-024-54771-4

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Zhaoxiang Cai et al. Nat Commun. 2024.

. 2024 Nov 29;15(1):10390.

doi: 10.1038/s41467-024-54771-4.

Authors

Affiliations

¹ ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia.
² INESC-ID, 1000-029, Lisboa, Portugal.
³ Instituto Superior Técnico (IST), Universidade de Lisboa, 1049-001, Lisboa, Portugal.
⁴ Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK.
⁵ ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, Australia. qzhong@cmri.org.au.
⁶ INESC-ID, 1000-029, Lisboa, Portugal. emanuel.v.goncalves@tecnico.ulisboa.pt.
⁷ Instituto Superior Técnico (IST), Universidade de Lisboa, 1049-001, Lisboa, Portugal. emanuel.v.goncalves@tecnico.ulisboa.pt.

^# Contributed equally.

PMID: 39614072
PMCID: PMC11607321
DOI: 10.1038/s41467-024-54771-4

Erratum in

Author Correction: Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning.
Cai Z, Apolinário S, Baião AR, Pacini C, Sousa MD, Vinga S, Reddel RR, Robinson PJ, Garnett MJ, Zhong Q, Gonçalves E. Cai Z, et al. Nat Commun. 2025 Feb 4;16(1):1352. doi: 10.1038/s41467-025-56686-0. Nat Commun. 2025. PMID: 39905123 Free PMC article. No abstract available.

Abstract

Integrating diverse types of biological data is essential for a holistic understanding of cancer biology, yet it remains challenging due to data heterogeneity, complexity, and sparsity. Addressing this, our study introduces an unsupervised deep learning model, MOSA (Multi-Omic Synthetic Augmentation), specifically designed to integrate and augment the Cancer Dependency Map (DepMap). Harnessing orthogonal multi-omic information, this model successfully generates molecular and phenotypic profiles, resulting in an increase of 32.7% in the number of multi-omic profiles and thereby generating a complete DepMap for 1523 cancer cell lines. The synthetically enhanced data increases statistical power, uncovering less studied mechanisms associated with drug resistance, and refines the identification of genetic associations and clustering of cancer cell lines. By applying SHapley Additive exPlanations (SHAP) for model interpretation, MOSA reveals multi-omic features essential for cell clustering and biomarker identification related to drug and gene dependencies. This understanding is crucial for developing much-needed effective strategies to prioritize cancer targets.

PubMed Disclaimer

Conflict of interest statement

Competing interests: AstraZeneca, GlaxoSmithKline, and Astex Pharmaceuticals have awarded M.J.G. research grants and M.J.G. is founder and advisor at Mosaic Therapeutics. All other authors declare no competing interests.

Figures

**Fig. 1. Cancer multi-omics integration with MOSA.**
a Cancer cell line multi-omic datasets across the 1523 cancer cell lines. Purple represents measured screens, while orange represents gaps, i.e., missing screens, which were synthetically generated with MOSA. b Schematic of the autoencoder, MOSA, where encoders are represented at the top and decoders at the bottom. For simplicity, the integration of only two datasets is represented. Highlighted designs of MOSA are illustrated on the right. Created in BioRender. Cai, Z. (2023) BioRender.com/m96b457. c Dimensionality reduction visualized using Uniform Manifold Approximation and Projection (UMAP) representation of the trained MOSA joint latent space, where each dot represents a cancer cell line colored according to its tissue of origin.

**Fig. 2. MOSA reconstruction of drug response and CRISPR-Cas9 datasets.**
a MOSA reconstruction quality measured using a 10-fold cross-validation. After reconstructing all test folds, they are concatenated and the reconstruction quality score is calculated as the Pearson’s r between the reconstructed and actual measured values. Features ranked by their reconstruction quality are shown for the drug response (left) and the CRISPR-Cas9 (right) datasets. Duplicated drug names represent replicated screens for the same drug. Representative examples of strongly selective CRISPR-Cas9 and drug responses are labeled. b MOSA’s partial dataset augmentation (missing value imputation) of drug IC50s compared to recent independent drug response screens. c–e, similar to b, using MOFA, MOVE and mean imputed values, respectively.

**Fig. 3. Multi-omics benchmark of MOSA.**
a Distribution of proteomics cancer cell lines correlation with an independent dataset (CCLE) grouped by whether the cancer cell line had proteomic data for the model training (orange, n = 291) versus cell lines without any proteomics prior (light blue, n = 78). b Distribution of cancer cell line correlations (Pearson’s r) between an independent drug response dataset (CTD2^,) and the MOSA reconstructed dataset, grouped by whether the cancer cell line had prior availability of drug response in the datasets for the model training (orange, n = 571) versus cell lines without drug response data (light blue, n = 239). c One-sided log-ratio test p-value of genetic associations with CRISPR-Cas9 gene essentiality with the original dataset (x-axis) and the augmented MOSA dataset (y-axis). False discovery rate (FDR) correction is applied using the Benjamini-Hochberg method to adjust for multiple comparisons. d Fisher skew test per gene across the original CRISPR-Cas9 dataset (x-axis) and the MOSA augmented dataset (y-axis). Dot size represents the number of cell lines that have the gene as essential (scaled log2 fold-change < −0.5) in the original dataset. e Correlation between *BRAF* and *MAPK1* CRISPR-Cas9 gene essentialities using both previous measured (Observed) and the synthetically reconstructed (Reconstructed). Gene essentiality scores are represented using copy-number corrected log2 fold-changes scaled by the median of common essential (score = −1) and non-essential (score = 0) genes. Gene essentialities are also grouped according to the presence or absence of a *BRAF* mutation, mostly V600E gain-of-function mutations. f CRISPR-Cas9 gene essentiality association with *FLI1*-*EWSR1* fusion. Confidence intervals of 95% are displayed for the regression lines in panels d, e, and f. Box-and-whisker plots show 1.5× interquartile ranges, centers indicate medians in panels e and f.

**Fig. 4. SHapley Additive exPlanations (SHAP) model explanation of MOSA.**
a Top features from each omic layer that contribute the most to the multi-omic latent space. b Top drugs that have the highest feature importance from metabolite 1-methylnicotinamide. c Top features that contribute the most to the reconstruction of the drug response of Daraprim (Pyrimethamine).

See this image and copyright information in PMC

References

1. Trastulla, L., Noorbakhsh, J., Vazquez, F., McFarland, J. & Iorio, F. Computational estimation of quality and clinical relevance of cancer cell lines. Mol. Syst. Biol.18, e11017 (2022). - PMC - PubMed
1. Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature483, 570–575 (2012). - PMC - PubMed
1. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483, 603–607 (2012). - PMC - PubMed
1. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature568, 511–516 (2019). - PubMed
1. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell170, 564–576.e16 (2017). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Associated data

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Affiliations

Synthetic augmentation of cancer cell line multi-omic datasets using unsupervised deep learning

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Medical