PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations

Jaroslav Bendl¹, Jan Stourac², Ondrej Salanda³, Antonin Pavelka⁴, Eric D Wieben⁵, Jaroslav Zendulka³, Jan Brezovsky⁴, Jiri Damborsky²

Affiliations

¹ Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
² Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
³ Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic.
⁴ Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic.
⁵ Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, New York, United States of America.

PMID: 24453961
PMCID: PMC3894168
DOI: 10.1371/journal.pcbi.1003440

PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations

Jaroslav Bendl et al. PLoS Comput Biol. 2014 Jan.

. 2014 Jan;10(1):e1003440.

doi: 10.1371/journal.pcbi.1003440. Epub 2014 Jan 16.

Authors

Jaroslav Bendl¹, Jan Stourac², Ondrej Salanda³, Antonin Pavelka⁴, Eric D Wieben⁵, Jaroslav Zendulka³, Jan Brezovsky⁴, Jiri Damborsky²

Affiliations

¹ Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
² Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
³ Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic.
⁴ Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic.
⁵ Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, New York, United States of America.

PMID: 24453961
PMCID: PMC3894168
DOI: 10.1371/journal.pcbi.1003440

Abstract

Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Workflow diagram describing construction of independent datasets.**
The various sources of mutation data are shown in yellow, intermediate datasets in white, Protein Mutant Database (PMD) testing dataset and the testing dataset compiled from studies on massively mutated proteins (MMP) in blue, and PredictSNP benchmark dataset in green. The data from the original training datasets of all evaluated tools shown in red were removed from newly constructed datasets.

**Figure 2. Distribution of amino acids in PredictSNP benchmark dataset.**
Expected distributions of amino acid residues were extracted from 105,990 sequences in the non-redundant OWL protein database (release 26.0) .

**Figure 3. Overall receiver operating characteristic curves for all three independent datasets.**
Comparison of PredictSNP and its constituent tools with PredictSNP benchmark dataset (A). Comparison of PredictSNP and other consensus classifiers with MMP data set (B) and PMD-UNIPROT dataset (C). The dashed line represents random ranking with AUC equal to 0.5.

**Figure 4. Workflow diagram of PredictSNP.**
Upon submission of the input sequence and specification of investigated mutations, integrated predictors of pathogenicity are employed for evaluation of the mutation and the consensus prediction is calculated. In the meantime, UniProt and PMD databases are queried to gather the relevant annotations.

**Figure 5. Graphic user interface of PredictSNP.**
The web server input (left) and output (right) page.

See this image and copyright information in PMC

References

1. Collins FS, Brooks LD, Chakravarti A (1998) A DNA polymorphism discovery resource for research on human genetic variation. Genome Res 8: 1229–1231 - PubMed
1. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073doi:10.1038/nature09534 - DOI - PMC - PubMed
1. Collins FS, Guyer MS, Charkravarti A (1997) Variations on a theme: cataloging human DNA sequence variation. Science 278: 1580–1581 - PubMed
1. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273: 1516–1517 - PubMed
1. Studer RA, Dessailly BH, Orengo CA (2013) Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 449: 581–594doi:10.1042/BJ20121221 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations

Affiliations

PredictSNP: robust and accurate consensus classifier for prediction of disease-related mutations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases