. 2018 Dec 20;14(12):e8430.

doi: 10.15252/msb.20188430.

A resource of variant effect predictions of single nucleotide variants in model organisms

Omar Wagih¹, Marco Galardini¹, Bede P Busby^{1

2}, Danish Memon¹, Athanasios Typas², Pedro Beltrao³

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK.
² European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK pbeltrao@ebi.ac.uk.

PMID: 30573687
PMCID: PMC6301329
DOI: 10.15252/msb.20188430

A resource of variant effect predictions of single nucleotide variants in model organisms

Omar Wagih et al. Mol Syst Biol. 2018.

. 2018 Dec 20;14(12):e8430.

doi: 10.15252/msb.20188430.

Authors

Omar Wagih¹, Marco Galardini¹, Bede P Busby^{1

2}, Danish Memon¹, Athanasios Typas², Pedro Beltrao³

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK.
² European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK pbeltrao@ebi.ac.uk.

PMID: 30573687
PMCID: PMC6301329
DOI: 10.15252/msb.20188430

Abstract

The effect of single nucleotide variants (SNVs) in coding and noncoding regions is of great interest in genetics. Although many computational methods aim to elucidate the effects of SNVs on cellular mechanisms, it is not straightforward to comprehensively cover different molecular effects. To address this, we compiled and benchmarked sequence and structure-based variant effect predictors and we computed the impact of nearly all possible amino acid and nucleotide variants in the reference genomes of Homo sapiens, Saccharomyces cerevisiae and Escherichia coli Studied mechanisms include protein stability, interaction interfaces, post-translational modifications and transcription factor binding sites. We apply this resource to the study of natural and disease coding variants. We also show how variant effects can be aggregated to generate protein complex burden scores that uncover protein complex to phenotype associations based on a set of newly generated growth profiles of 93 sequenced S. cerevisiae strains in 43 conditions. This resource is available through mutfunc (www.mutfunc.com), a tool by which users can query precomputed predictions by providing amino acid or nucleotide-level variants.

Keywords: burden score; genetic variants; genotype‐to‐phenotype; model organisms; resource.

PubMed Disclaimer

Figures

**Figure 1. Population‐level sequence constraint in genome functional elements**
The level of sequence constraint was estimated using a ratio of the counts of genome variants across individuals of yeast and human compared with a random control region for different functional elements.
Regions buried within a protein structure with a low RSA typically exhibit higher evolutionary constraint.
Similarly, regions buried within interaction interfaces exhibit a high ∆RSA and demonstrate stronger sequence constraints.
Sequence constraint on PTMs, where numbers reflect the number of PTM sites for each modification.
PTMs with a higher number of neighbouring PTMs show stronger constraint.
Variability in constraint among bindings sites for TFs with at least 40 sites.
TFBSs that coexist with other binding sites are under stronger constraint.
Position‐specific constraint shows that positions of higher relevance for binding in TFs with at least 20 sites are under stronger constraint. Notches represent the 95% CI in the median, box limits the IQR and upper whiskers the 75^th percentile. The horizontal line represents the null expectation of no difference between observed and expected, same as in all other panels of this figure.
Four examples where the bar plots reflect the position‐specific constraint in (blue) and around (grey) the binding site, along with sequence logos for the binding specificities.
Data information: (A, B, F) P‐values represent a one‐sided Wilcoxon test. (A, B, C, D, F) Error bars represent the standard deviation. One hundred random samples were used. (G) P‐value shown is computed using a one‐sided Kolmogorov–Smirnov test.

**Figure 2. The mutfunc resource and benchmarking of underlying variant effect predictors**
A
The mutfunc interface provides an intuitive, user‐friendly way by which users can query the resource using DNA or protein substitutions provided in plain text format or the variant call format (VCF). The impact of variants across different mechanisms is provided with information on impact strength in downloadable format and/or protein structural views.
B
The fraction of variants predicted to affect a conserved or structural important residues for essential and nonessential genes. For yeast SIFT, the number of essential/non‐essential genes are 3,967 and 906, respectively. For yeast foldx the numbers are 925 and 281. For human sift the numbers are 15,542 and 1,575. For human foldx the numbers are 3,702 and 499.
C
Mean SIFT scores and predicted ∆∆G values for human and yeast variants within different MAF bins. Error bars represent the standard error, and P‐values are calculated based on a one‐sided Wilcoxon test.
D
Pathogenic and benign variants were obtained for human (from ClinVar) and yeast (curated) as described in the Materials and Methods section. These were used to benchmark the capacity of different predictors to discriminate between known pathogenic and benign variants.
E, F
The proportion of pathogenic versus benign variants that disrupt or not different functional annotations (SLiMs, PTMs or stop gains/losses) in human (E) and yeast (F). Number of replicates is 100 (i.e. random samples).
Data information: (B, E, F) P‐values represent a one‐sided Wilcoxon test. (E, F) Error bars for random samples represent the standard deviation.

**Figure EV1. Impact of structural models on structure‐based variant effect prediction**
We obtained from the ProTherm database variants with experimentally determined impact on stability and classified them as destabilizing if ΔΔG > 2 and not destabilizing otherwise. FoldX‐based predictions were tested on their capacity to discriminate between these two classes of variants using different types of experimental models and regions within homology models with different predicted quality.

**Figure 3. Analysis of variants of uncertain clinical significance using mutfunc**
A–C
Three examples of interaction interfaces containing variants predicted to impact binding stability. Subunits of the interaction complex are coloured in dark grey and white, and respective interface residues in dark green and green.
D, E
Two examples of variants predicted to impact protein stability. Pathogenic variants are labelled “P” in red, and VUSs “U” in blue.

**Figure 4. Phenotypic screening of 166 yeast strains**
Concordance between replicate s‐score measurements.
Heatmap of s‐scores showing hierarchical clustering of both strains and conditions reveals clusters of phenotypically similar strains and conditions.
Comparison of pairwise genotype and phenotype distances between 93 sequenced strains shows little observable correlation.

**Figure 5. Gene and protein complex‐level aggregation of variant effects for phenotype association analysis**
Diagram demonstrating the aggregation of variant impact. Each variant is first assigned a probability of deleteriousness, which are aggregated at the gene level using the maximum impact.
The probability of deleteriousness for FoldX and SIFT was computed by assessing the proportion of deleterious variants in gold‐standard data for FoldX and SIFT. A logistic regression model (red line) is fit to compute subsequent probabilities. Protein complex‐level burden scores were taken to be the maximal burden for any complex member.
Gene and complex burden scores for each strain, gene/complex‐phenotype associations were carried out.
Volcano plot with gene–complex associations highlighting the effect size and P‐value of selected examples.
S‐score growth distributions for strains having a low (P _AF < 90, red) or high (P _AF > 90, blue) burden scores for three selected complexes. The protein subunits of each complex are shown with affected subunits in blue with the number of strains in which the subunit is predicted to be impaired in parenthesis. Subunits in red are not predicted to be impaired in any strain.

**Figure EV2. Gene and protein complex‐level phenotype association analysis show significant but modest enrichment in prior knowledge of gene KO growth phenotypes**
The fraction of gene–phenotype associations that is validated by chemical genetic information derived from gene‐deletion experiments. Shaded area reports the interquartile range over 100 iterations. The significance of the observed overlap was tested using permutation testing.
Associations between protein complexes and conditions were benchmarked by calculating the enrichment of previously known gene‐condition associations from gene‐deletion studies. Shaded area reports the interquartile range over 100 iterations. An enrichment was observed for some cut‐offs for the gene‐deletion condition‐dependent essentiality but only found to be better than random expectation for stringent cut‐off.

See this image and copyright information in PMC

Comment in

From prioritisation to understanding: mechanistic predictions of variant effects.
Slodkowicz G, Babu MM. Slodkowicz G, et al. Mol Syst Biol. 2018 Dec 20;14(12):e8741. doi: 10.15252/msb.20188741. Mol Syst Biol. 2018. PMID: 30573689 Free PMC article.

References

1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249 - PMC - PubMed
1. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh L‐SL (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32: D115–D119 - PMC - PubMed
1. de Beer TAP, Laskowski RA, Parks SL, Sipos B, Goldman N, Thornton JM (2013) Amino acid changes in disease‐associated variants differ radically from variants observed in the 1000 genomes project dataset. PLoS Comput Biol 9: e1003382 - PMC - PubMed
1. Beltrao P, Bork P, Krogan NJ, van Noort V (2013) Evolution and functional cross‐talk of protein post‐translational modifications. Mol Syst Biol 9: 714 - PMC - PubMed
1. Bergström A, Simpson JT, Salinas F, Barré B, Parts L, Zia A, Nguyen Ba AN, Moses AM, Louis EJ, Mustonen V, Warringer J, Durbin R, Liti G (2014) A high‐definition view of functional genetic variation from natural yeast genomes. Mol Biol Evol 31: 872–888 - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

MC_U105185859/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A resource of variant effect predictions of single nucleotide variants in model organisms

Affiliations

A resource of variant effect predictions of single nucleotide variants in model organisms

Authors

Affiliations

Abstract

Figures

Comment in

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases