Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 14;13(3):e0192829.
doi: 10.1371/journal.pone.0192829. eCollection 2018.

Generalising better: Applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies

Affiliations

Generalising better: Applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies

Ilia Korvigo et al. PLoS One. .

Abstract

Many automatic classifiers were introduced to aid inference of phenotypical effects of uncategorised nsSNVs (nonsynonymous Single Nucleotide Variations) in theoretical and medical applications. Lately, several meta-estimators have been proposed that combine different predictors, such as PolyPhen and SIFT, to integrate more information in a single score. Although many advances have been made in feature design and machine learning algorithms used, the shortage of high-quality reference data along with the bias towards intensively studied in vitro models call for improved generalisation ability in order to further increase classification accuracy and handle records with insufficient data. Since a meta-estimator basically combines different scoring systems with highly complicated nonlinear relationships, we investigated how deep learning (supervised and unsupervised), which is particularly efficient at discovering hierarchies of features, can improve classification performance. While it is believed that one should only use deep learning for high-dimensional input spaces and other models (logistic regression, support vector machines, Bayesian classifiers, etc) for simpler inputs, we still believe that the ability of neural networks to discover intricate structure in highly heterogenous datasets can aid a meta-estimator. We compare the performance with various popular predictors, many of which are recommended by the American College of Medical Genetics and Genomics (ACMG), as well as available deep learning-based predictors. Thanks to hardware acceleration we were able to use a computationally expensive genetic algorithm to stochastically optimise hyper-parameters over many generations. Overfitting was hindered by noise injection and dropout, limiting coadaptation of hidden units. Although we stress that this work was not conceived as a tool comparison, but rather an exploration of the possibilities of deep learning application in ensemble scores, our results show that even relatively simple modern neural networks can significantly improve both prediction accuracy and coverage. We provide open-access to our finest model via the web-site: http://score.generesearch.ru/services/badmut/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: Ilia Korvigo was previously employed by iBinon Inc and worked on this project before leaving the company. The commercial affiliation of Mr. Andrey Afanasyev and previous affiliation for Ilia Korvigo does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products to declare.

Figures

Fig 1
Fig 1. Prediction inconsistency.
A heatmap of Spearman correlation between rank-transformed output values of different deleteriousness scoring systems. 1000F—allele frequency according to the 1000 Genomes project. Greater absolute correlation means greater consistency.
Fig 2
Fig 2. Network types.
Schematic representation of basic deep learning models used in this study. (a) A multilayer perceptron (MLP). (b) A shallow denoising autoencoder (dAE). (c) Connecting dAEs into a stacked denoising autoencoder (sdAE); notice that each individual dAE learns to reconstruct the latent representation from the previous one (data stream is represented by arrows). Colours encode layer functions (combinations are possible): blue—input, light-red—latent, dark-red—dropout (noise), purple—output, hollow—discarded.
Fig 3
Fig 3. Nonlinearities.
The sigmoid (a) and hyperbolic tangent (b) iteratively applied 3 times. Observe how repeated application of the sigmoid function quickly makes the gradient vanish completely.
Fig 4
Fig 4. ROC-curves.
MLP, MetaLR, MetaSVM, sDAE and MutationTaster produced the largest area under the curve.

References

    1. Ng SB, Nickerson DA, Bamshad MJ, Shendure J. Massively parallel sequencing and rare disease. Human Molecular Genetics. 2010;19(R2):R119–R124. doi: 10.1093/hmg/ddq390 - DOI - PMC - PubMed
    1. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: Application to cancer genomics. Nucleic Acids Research. 2011;39(17):37–43. doi: 10.1093/nar/gkr407 - DOI - PMC - PubMed
    1. Ng PC, Henikoff S. Predicting the Effects of Amino Acid Substitutions on Protein Function. Annu Rev Genom Hum Genet. 2006;7(1):61–80. doi: 10.1146/annurev.genom.7.080505.115630 - DOI - PubMed
    1. Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Human Mutation. 2009;30(5):703–714. doi: 10.1002/humu.20938 - DOI - PubMed
    1. Cooper GM, Goode DL, Ng SB, Sidow A, Bamshad MJ, Shendure J, et al. Single-nucleotide evolutionary constraint scores highlight disease-causing mutations. Nature Methods. 2010;7(4):250–251. doi: 10.1038/nmeth0410-250 - DOI - PMC - PubMed

Publication types

LinkOut - more resources