Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers
- PMID: 18186470
- DOI: 10.1002/prot.21838
Statistical geometry based prediction of nonsynonymous SNP functional effects using random forest and neuro-fuzzy classifiers
Abstract
There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).
Similar articles
-
Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information.Bioinformatics. 2005 May 15;21(10):2185-90. doi: 10.1093/bioinformatics/bti365. Epub 2005 Mar 3. Bioinformatics. 2005. PMID: 15746281
-
Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms.J Theor Biol. 2010 Oct 21;266(4):560-8. doi: 10.1016/j.jtbi.2010.07.026. Epub 2010 Jul 23. J Theor Biol. 2010. PMID: 20655929
-
Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis.Bioinformatics. 2008 Sep 15;24(18):2002-9. doi: 10.1093/bioinformatics/btn353. Epub 2008 Jul 16. Bioinformatics. 2008. PMID: 18632749
-
Computational prediction of the effects of non-synonymous single nucleotide polymorphisms in human DNA repair genes.Neuroscience. 2007 Apr 14;145(4):1273-9. doi: 10.1016/j.neuroscience.2006.09.004. Epub 2006 Oct 19. Neuroscience. 2007. PMID: 17055652 Review.
-
Supervised learning with decision tree-based methods in computational and systems biology.Mol Biosyst. 2009 Dec;5(12):1593-605. doi: 10.1039/b907946g. Epub 2009 Oct 5. Mol Biosyst. 2009. PMID: 20023720 Review.
Cited by
-
Functional hot spots in human ATP-binding cassette transporter nucleotide binding domains.Protein Sci. 2010 Nov;19(11):2110-21. doi: 10.1002/pro.491. Protein Sci. 2010. PMID: 20799350 Free PMC article.
-
Determining effects of non-synonymous SNPs on protein-protein interactions using supervised and semi-supervised learning.PLoS Comput Biol. 2014 May 1;10(5):e1003592. doi: 10.1371/journal.pcbi.1003592. eCollection 2014 May. PLoS Comput Biol. 2014. PMID: 24784581 Free PMC article.
-
Analysis of genetic variation and potential applications in genome-scale metabolic modeling.Front Bioeng Biotechnol. 2015 Feb 16;3:13. doi: 10.3389/fbioe.2015.00013. eCollection 2015. Front Bioeng Biotechnol. 2015. PMID: 25763369 Free PMC article. Review.
-
GESPA: classifying nsSNPs to predict disease association.BMC Bioinformatics. 2015 Jul 25;16:228. doi: 10.1186/s12859-015-0673-2. BMC Bioinformatics. 2015. PMID: 26206375 Free PMC article.
-
Assigning function to natural allelic variation via dynamic modeling of gene network induction.Mol Syst Biol. 2018 Jan 15;14(1):e7803. doi: 10.15252/msb.20177803. Mol Syst Biol. 2018. PMID: 29335276 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources