Multi-modal meta-analysis of cancer cell line omics profiles identifies ECHDC1 as a novel breast tumor suppressor
- PMID: 33750001
- PMCID: PMC7983037
- DOI: 10.15252/msb.20209526
Multi-modal meta-analysis of cancer cell line omics profiles identifies ECHDC1 as a novel breast tumor suppressor
Abstract
Molecular and functional profiling of cancer cell lines is subject to laboratory-specific experimental practices and data analysis protocols. The current challenge therefore is how to make an integrated use of the omics profiles of cancer cell lines for reliable biological discoveries. Here, we carried out a systematic analysis of nine types of data modalities using meta-analysis of 53 omics studies across 12 research laboratories for 2,018 cell lines. To account for a relatively low consistency observed for certain data modalities, we developed a robust data integration approach that identifies reproducible signals shared among multiple data modalities and studies. We demonstrated the power of the integrative analyses by identifying a novel driver gene, ECHDC1, with tumor suppressive role validated both in breast cancer cells and patient tumors. The multi-modal meta-analysis approach also identified synthetic lethal partners of cancer drivers, including a co-dependency of PTEN deficient endometrial cancer cells on RNA helicases.
Keywords: cancer driver; data integration; multi-omics data; reproducibility; synthetic lethality.
© 2021 The Authors. Published under the terms of the CC BY 4.0 license.
Conflict of interest statement
The authors declare that they have no conflict of interest.
Figures

Overview of datasets, research sites, and molecular modalities that were analyzed in the study.
The number of cell lines having data for the 9 types of modalities that were analyzed in the study.
The number of cell lines for which data were available for each of the modality types.
Correlation of the different types of data modalities of cancer cell lines profiled at multiple research sites. Spearman’s correlation was calculated between identical cell lines for the shared set of genes that were overlapping between any two datasets. Gray distributions show the correlation of non‐identical cell lines between datasets from various research sites for comparison. N g and N c indicate the median [ranges] of the number of genes and cell lines, respectively, across the pairwise comparisons made between datasets from different research sites. More details on the breakdown of N c and N g by data modality and research site is available in Appendix Figs S5B and S6, respectively, and Appendix Fig S7C shows the correlation P‐values adjusted for the sample size (N g). For the point mutation view, only those genes having mutations with an associated functional consequence were considered in the Matthews correlation analysis. Only those datasets for which the mutation profiles were obtained using the whole‐exome sequencing technology were considered in this study. Horizontal lines mark the median value. Target addiction score (TAS), Drug Sensitivity Score (DSS), Gene dependency (FUNC), protein phosphorylation (PHOS), protein expression (PEXP), gene expression (GEXP), copy number variation (CNV), point mutation (MUT) and methylation (METH) profiles.

Correlation matrix plot of average Spearman’s correlation of gene dependency profiles of cancer cell lines calculated based on genome‐wide RNAi screens and CRISPR screens. Number of overlapping cell lines between any two datasets used for estimating the average correlation ranges between 2 and 284, with a mean of 46.4. The empty cells indicate that no identical cell lines were profiled between the two datasets.
Distribution of Spearman’s correlation of gene dependency profiles between different study sites. Triangles represent mean correlation values. Numbers below the labels represent the number of overlapping cell lines based on which the distributions were drawn.
Average Spearman’s correlation of MS‐based proteomic profiles between different study sites generated using different peptide labeling procedures. The empty cells indicate that no identical cell lines were profiled between the two datasets. Number of overlapping cell lines between any two datasets used for estimating the average correlation ranges between 3 and 27, with a mean of 7.8.
Distribution of Spearman’s correlation of MS‐based proteomic profiles between different study sites. Numbers below the labels represent the number of overlapping cell lines based on which the distributions were drawn. Triangles correspond to the median value.
Coefficient of variation (CV) of proteins detected and quantified in UW TNBC study (non‐TMT‐labeled) vs. MGHCC BREAST study (TMT‐labeled). Both studies had a maximal overlap of breast cancer cell lines for a robust estimation of CV. Housekeeping genes are highlighted as red dots. Spearman’s correlation (rcv) was calculated to estimate the agreement in the CV estimates of common set of proteins between the two studies.

CLIP performs a meta‐analysis of datasets from multiple sites for each data modality type: Target addiction score (TAS), Gene dependency (FUNC), protein phosphorylation (PHOS), protein expression (PEXP), gene expression (GEXP), copy number variation (CNV), point mutation (MUT) and methylation (METH) profiles.
For each modality type, CLIP iterates over datasets available from multiple sites and quantifies the cancer context specificity (CCS) property for every gene G in cell line j.
For all unique cell lines, the CSS property is quantified for each gene G in a dataset D. For continuous modalities (METH, GEXP, PEXP, PHOS, FUNC, TAS), we defined the Outlier Evidence Score (OESG,D,j), calculated by normalizing the observed value by the mean in the dataset for each gene (Xi). SD is defined as the standard deviation. For binary modalities (CNV‐GAIN, CNV‐LOSS and MUT), we defined the Proportion Score (PSG,D,j) for each gene G in cell line j, calculated as the frequency of the alteration (FD,j) normalized by the total samples in each dataset (ND,j).
For a given cell line j, OESG,D scores across the available datasets are integrated using the Rank Product analysis to find statistically consistent genes that are at the top of the ranked list of genes (CCSUP) or at the bottom (CCSDOWN).
Finally, CLIP produces a profile of all the genes that are identified as CCS. In total, 13 different modality features were assessed by the CLIP framework, provided there are data available for a cell line for all the molecular datatypes. All genes identified as a CCS gene in any modality are highlighted, light orange for up‐regulation and light blue for down‐regulation. Genes that have CCS evidence across two or more modality types are considered in our analyses as robust Cancer Context‐Specific (rCCS) genes, highlighted as light green.
A schematics of CLIP signature of a hypothetical gene, which summarizes its CCS evidence in a selected subset of cell lines, defined as a group based on any relevant criteria (the example shows all HER2+ breast cancer cell lines). Y‐axis is the ratio of number of cell lines in which the gene is identified as a CCS gene vs. the total number of cell lines in the particular subset.

Subset of cell line‐specific drivers that were identified as rCCS genes in this study. Highlighted entries indicate that the gene was identified as a rCCS gene in that modality.
Proportion of the rCCS genes identified by CLIP and supported by the various data modalities, relative to the average number of genes profiled for each modality in all cancer cell lines (n = 1,047). Boxes represent the interquartile range, notch in each box represents median value and whiskers the range of the values.
Proportion of ER+ and ER− breast cancer cell lines that have ESR1 as a rCCS gene. P‐value was calculated with the Fisher’s exact test.
The data modalities that supported the rCCS status of ESR1 and the proportion of cell lines having that evidence in the ER+ cell lines (n = 20).
Proportion of HER2+ and HER2− breast cancer cell lines that have ERBB2 as a rCCS gene. P‐value was calculated with the Fisher’s exact test.
The data modalities that supported the rCCS status of ERBB2 and the proportion of cell lines having that evidence in the HER2+ cell lines (n = 17).
Benchmarking the performance of CLIP to identify well‐known breast cancer driver genes. True positive (TP) fraction of unique cancer driver genes (n = 201) for the three defined breast cancer subtypes as identified by CLIP and alternative approaches based on differential analysis in each specific modality alone, and using the latent factor‐based Multi‐Omics Factor Analysis (MOFA+) methods for data integration.

The CLIP signature of ECHDC1 suggests that it was hypermethylated and down‐expressed in all the breast cancer cell lines (n = 24) in which it was identified as rCCS gene.
Breast cancer‐specific survival (BCSS) based on gene expression and methylation levels of ECHDC1 in breast cancer patient tumors in the combined Metabric and Oslo datasets (n = 3,885). Patients in the low GEXP category class have lower BCSS than those in the non‐low GEXP group. Numbers above the x‐axis line indicate the number of patients in each group, defined by the color code, at each time point. P‐value from age‐adjusted Cox‐proportion hazard model.
Benign breast epithelial MCF10A and breast carcinoma BT‐474 cells were embedded in 3D collagen as single cells or as spheroids, respectively, and the growth was followed for 5 days. Light micrographs show filamentous actin (phalloidin) and nuclei (Hoechst) in representative cell colonies. Quantitative assessment of the nuclei counts per colony show the induced proliferation in MCF10A cells after ECHCD1 sgRNA knockout. At 72 h, MCF10A mock vs. ECHDC1_sgRNA_1 and ECHDC1_sgRNA_2 P < 0.05; at 96 h mock vs. ECHDC1_sgRNA_1, ECHDC1_sgRNA_2 and ECHDC1_sgRNA_3 P < 0.001; at 120 h mock vs. ECHDC1_sgRNA_1, ECHDC1_sgRNA_2, and ECHDC1_sgRNA_3 P < 0.0001. Nuclei count relative to mock 0 h. Error bars indicate mean ± SEM; n ≥ 10 colonies. Statistical significance was assessed with one‐way ANOVA with Tukey’s multiple comparison test. Scale bar 50 µm.
Metabolic pathway of propanoate metabolism.
Measured metabolite levels of intermediates in propanoate metabolism in select breast cancer cell lines with or without the ECHDC1 rCCS status (n = 7 in both groups). Boxes represent the interquartile range, whiskers represent the range of the values and solid line within the box correspond to the median value. Outlier points indicates values not included between the whiskers. Statistical significance was assessed with Wilcoxon test.

Proportion of KRAS‐mutated (Mut) and KRAS wild‐type (WT) cancer cell lines with KRAS identified as a rCCS gene. P‐value was calculated with Fisher’s exact test.
The modalities that support the rCCS status of KRAS and the proportion of cell lines having that evidence in the KRAS‐mutated cell lines (n = 155).
Proportion of PIK3CA‐mutated (Mut) and PIK3CA wild‐type (WT) cancer cell lines with PIK3CA identified as a rCCS gene. P‐value was calculated with Fisher’s exact test.
The modalities that support the rCCS status of PIK3CA and the proportion of cell lines having that evidence in the PIK3CA‐mutated cell lines (n = 140).
Systematic identification of cancer driver genes specific to epithelial cancer cell lines (n = 737) under multiple settings of CLIP run. rCCS genes identified by CLIP are enriched for known cancer drivers compared to non‐driver genes, even after excluding the FUNC data modality from the CLIP approach. Boxes represent the interquartile range, whiskers represent the range of the values and solid line within the box correspond to the median value. Statistical significance was assessed with Wilcoxon test.
Proportion of PTEN‐mutated (Mut) and PTEN wild‐type (WT) cancer cell lines in which DDX27 was identified as a rCCS gene. P‐value was calculated with Fisher’s exact test.
The modalities that supported the rCCS status of DDX27 and the proportion of cell lines having that evidence in the PTEN‐mutated cell lines (n = 77).
Survival analysis based on mRNA expression levels of DDX27 in patients with endometrial cancer in the TCGA dataset. Expression levels were divided into 2 classes, high (n = 203) and low (n = 322), based on mean expression level of DDX27 (logFPKM = 18.27). Patients in the high class showed lower survival probability than those in the low class (P = 4.2 × 10−4; log‐rank test).
mRNA expression levels of DDX27 in PTEN‐mutated (Mut, n = 302) and PTEN wild‐type (WT, n = 224) endometrial patient tumors in the TCGA dataset. Triangles correspond to the median value. P‐value was calculated with Wilcoxon test.
References
-
- Abramoff MD, Magalhaes PJ, Ram SJ (2004) Image processing with ImageJ. Biophotonics Int 11: 36–42
-
- Ashworth A, Lord CJ, Reis‐Filho JS (2011) Genetic interactions in cancer progression and treatment. Cell 145: 30–38 - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials