Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 15;38(18):4360-4368.
doi: 10.1093/bioinformatics/btac523.

Overcoming selection bias in synthetic lethality prediction

Affiliations

Overcoming selection bias in synthetic lethality prediction

Colm Seale et al. Bioinformatics. .

Abstract

Motivation: Synthetic lethality (SL) between two genes occurs when simultaneous loss of function leads to cell death. This holds great promise for developing anti-cancer therapeutics that target synthetic lethal pairs of endogenously disrupted genes. Identifying novel SL relationships through exhaustive experimental screens is challenging, due to the vast number of candidate pairs. Computational SL prediction is therefore sought to identify promising SL gene pairs for further experimentation. However, current SL prediction methods lack consideration for generalizability in the presence of selection bias in SL data.

Results: We show that SL data exhibit considerable gene selection bias. Our experiments designed to assess the robustness of SL prediction reveal that models driven by the topology of known SL interactions (e.g. graph, matrix factorization) are especially sensitive to selection bias. We introduce selection bias-resilient synthetic lethality (SBSL) prediction using regularized logistic regression or random forests. Each gene pair is described by 27 molecular features derived from cancer cell line, cancer patient tissue and healthy donor tissue samples. SBSL models are built and tested using approximately 8000 experimentally derived SL pairs across breast, colon, lung and ovarian cancers. Compared to other SL prediction methods, SBSL showed higher predictive performance, better generalizability and robustness to selection bias. Gene dependency, quantifying the essentiality of a gene for cell survival, contributed most to SBSL predictions. Random forests were superior to linear models in the absence of dependency features, highlighting the relevance of mutual exclusivity of somatic mutations, co-expression in healthy tissue and differential expression in tumour samples.

Availability and implementation: https://github.com/joanagoncalveslab/sbsl.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Structure of SL labels. Adjacency plot showing OV gene pairs. Elements along horizontal and vertical axes represent unique genes. Each coloured cell denotes a negative (red) or positive (blue) SL pair. White cells denote pairs with no label. Rows are ordered according to hierarchical clustering with complete linkage and Euclidean distance. Columns follow the ordering of rows. The barplot to the right shows the number of pairs each gene is involved in. The group of eight genes at the bottom of the plot (highlighted in red) consists mostly of tyrosine kinases (A color version of this figure appears in the online version of this article.)
Fig. 2.
Fig. 2.
Cross-SL gold standard performances. AUROC values averaged over 10 runs for: (left) BRCA models trained on ISLE and tested on DiscoverSL; (right) LUAD models were trained on DiscoverSL and tested on ISLE
Fig. 3.
Fig. 3.
Performances of gene holdout experiments, where bias is controlled by ensuring that none, one or both genes of pairs in the test set are excluded from the train set. Shown are AUROC values for each gene-holdout experiment per cancer type (10 runs). For ‘None’, we only guarantee that train and test sets are disjoint in terms of gene pairs, not individual genes; for ‘Single’, only one gene from a gene pair in the test set can be present in the train set; for ‘Double’ neither gene of a pair in the test set appears in the train set. The results for ‘None’ correspond to those also reported in Table 2. Note: There was insufficient data to conduct the OV ‘Double’ experiment
Fig. 4.
Fig. 4.
Cross-cancer and LOCO performances. Average AUROC for L0L2 and MUVR models over 10 runs. Cross-cancer: Vertical and horizontal axes denote the cancer types used to train and test, respectively. LOCO: Horizontal axis denotes the cancer type held out for testing. Models trained on balanced data from all other cancers
Fig. 5.
Fig. 5.
Performance of SBSL models with and without gene dependency-based features (AUROC over 10 runs), respectively, labelled ‘Full Feature Set’ and ‘No Dep Features’

References

    1. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 25–29. - PMC - PubMed
    1. Babur Ö. et al. (2015) Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Genome Biol., 16, 45. - PMC - PubMed
    1. Bangdiwala S.I. (1989) The wald statistic in proportional hazards hypothesis testing. Biom. J., 31, 203–211.
    1. Behan F.M. et al. (2019) Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature, 568, 511–516. - PubMed
    1. Benstead-Hume G. et al. (2019) Predicting synthetic lethal interactions using conserved patterns in protein interaction networks. PLoS Comput. Biol., 15, e1006888. - PMC - PubMed

Publication types