Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 8(Suppl 8):S1.
doi: 10.1186/1471-2164-16-S8-S1. Epub 2015 Jun 18.

Better prediction of functional effects for sequence variants

Better prediction of functional effects for sequence variants

Maximilian Hecht et al. BMC Genomics. 2015.

Abstract

Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method's improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method's carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web.

PubMed Disclaimer

Figures

Figure 1
Figure 1
SNAP2 performs best for the ALL data set. This figure shows performance estimates for the ALL data set. Our new method SNAP2 (dark blue, AUC = 0.905) outperforms its predecessor SNAP (light blue, AUC = 0.880), PolyPhen-2 (orange, AUC = 0.853) and SIFT (green, AUC = 0.838) over the entire spectrum of the Receiver Operating Characteristic (ROC) curve. Curves are significantly different from each other at a significance level of P < 10-4 as measured by the DeLong method [59]. All SNAP2 results were computed on the test sets not used in training after a rigorous split into training, cross-training and testing. Results for PolyPhen-2 and our original SNAP included some of those proteins in their training, suggesting over-estimated performance.
Figure 2
Figure 2
Naïve combination is not better than individual methods for PMD_HUMAN data. This figure shows accuracy-coverage curves for the PMD_HUMAN data. The x-axes indicate coverage (also referred to as 'recall'; Eqn. 1.2), i.e. the percentage of observed neutral (a) and of observed effect (b) variants that are correctly predicted at the given threshold. The y-axes indicate accuracy (also referred to as 'precision'; Eqn. 1.2), i.e. the percentage of neutral (a) and effect (b) variants among all variants predicted in either class at the given threshold. Arrows mark the performance at the default thresholds for our new method SNAP2 (dark blue), for SIFT (green), and for PolyPhen-2 (orange). A brown triangle/arrow marks the performance of a (non-optimized) method that combines PolyPhen-2 and SIFT. This combination did not perform better than SNAP2 alone (brown triangle vs. blue SNAP2 curves).
Figure 3
Figure 3
SNAP2 and PolyPhen-2 are best for difficult human variants. Bars mark the two-state accuracy (Q2; Eqn. 4) at the default thresholds for SNAP2 (dark blue), SNAP (light blue), SIFT (green), and PolyPhen-2 (orange). Random prediction performance assuming 60:40 effect:neutral background are given in pink. Analysis is based on 3,963 'difficult' cases (2,589 effect; 1,374 neutral) from PMD_HUMAN set. Difficult cases were defined as variants where any of the above method's predictions disagreed; i.e. cases where not all methods, excluding random, gave the same prediction.
Figure 4
Figure 4
SNAP2 threshold and reliability. The reliability index provides a means of focusing on the most accurate predictions. Panel (a) shows SNAP2 performance on the balanced PMD/EC data set over the entire spectrum of accuracy (solid lines) and coverage (dotted lines) for both effect (red) and neutral (green) variants depending on the chosen threshold (x-axis). The default threshold was set to -0.05, where neutral and effect predictions performed alike (black arrow). By moving the decision threshold users can optimize predictive behavior towards their research needs: predictions at higher absolute scores (e.g. TP>0.5 or TN<-0.5) are much more likely correct but they are not available for all variants. Panel (b) directly relates the reliability index (RI) to the performance on our data. Shown is the cumulative percentage of predictions (x-axis) against accuracy (solid lines) and coverage (dotted lines) above a given reliability index (RI; Methods). Accuracy and coverage are shown separately for neutral (green) and effect (red) predictions. Each marker depicts a reliability threshold ranging from 0 (right most marker, low reliability) to 9 (left most marker, high reliability). Labels for RI >= 2, 4 and, 6 are skipped for simplicity. For instance, 58% of all predictions in our cross-validation were made at reliability levels of 7 or higher (gray arrows). At this reliability, 95% of all effect predictions and 90% of all neutral predictions were correct.

Similar articles

Cited by

References

    1. Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. Journal of Theoretical Biology. 1965;8:357–366. doi: 10.1016/0022-5193(65)90083-4. - DOI - PubMed
    1. Schwarz JM, Rodelsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods. 2010;7(8):575–576. doi: 10.1038/nmeth0810-575. - DOI - PubMed
    1. Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80–92. doi: 10.4161/fly.19695. - DOI - PMC - PubMed
    1. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26(16):2069–2070. doi: 10.1093/bioinformatics/btq330. - DOI - PMC - PubMed
    1. Schaefer C, Rost B. Predict impact of single amino acid change upon protein structure. BMC Genomics. 2012;13(Suppl 4):S4. doi: 10.1186/1471-2164-13-S4-S4. - DOI - PMC - PubMed

Publication types

Substances