Assessment of computational methods for predicting the effects of missense mutations in human cancers

Florian Gnad¹, Albion Baucom, Kiran Mukhyala, Gerard Manning, Zemin Zhang

Affiliations

PMID: 23819521
PMCID: PMC3665581
DOI: 10.1186/1471-2164-14-S3-S7

Comparative Study

Assessment of computational methods for predicting the effects of missense mutations in human cancers

Florian Gnad et al. BMC Genomics. 2013.

. 2013;14 Suppl 3(Suppl 3):S7.

doi: 10.1186/1471-2164-14-S3-S7. Epub 2013 May 28.

Authors

Florian Gnad¹, Albion Baucom, Kiran Mukhyala, Gerard Manning, Zemin Zhang

Affiliation

¹ Department of Bioinformatics and Computational Biology, Genentech Inc, South San Francisco, CA 94080, USA.

PMID: 23819521
PMCID: PMC3665581
DOI: 10.1186/1471-2164-14-S3-S7

Abstract

Background: Recent advances in sequencing technologies have greatly increased the identification of mutations in cancer genomes. However, it remains a significant challenge to identify cancer-driving mutations, since most observed missense changes are neutral passenger mutations. Various computational methods have been developed to predict the effects of amino acid substitutions on protein function and classify mutations as deleterious or benign. These include approaches that rely on evolutionary conservation, structural constraints, or physicochemical attributes of amino acid substitutions. Here we review existing methods and further examine eight tools: SIFT, PolyPhen2, Condel, CHASM, mCluster, logRE, SNAP, and MutationAssessor, with respect to their coverage, accuracy, availability and dependence on other tools.

Results: Single nucleotide polymorphisms with high minor allele frequencies were used as a negative (neutral) set for testing, and recurrent mutations from the COSMIC database as well as novel recurrent somatic mutations identified in very recent cancer studies were used as positive (non-neutral) sets. Conservation-based methods generally had moderately high accuracy in distinguishing neutral from deleterious mutations, whereas the performance of machine learning based predictors with comprehensive feature spaces varied between assessments using different positive sets. MutationAssessor consistently provided the highest accuracies. For certain combinations metapredictors slightly improved the performance of included individual methods, but did not outperform MutationAssessor as stand-alone tool.

Conclusions: Our independent assessment of existing tools reveals various performance disparities. Cancer-trained methods did not improve upon more general predictors. No method or combination of methods exceeds 81% accuracy, indicating there is still significant room for improvement for driver mutation prediction, and perhaps more sophisticated feature integration is needed to develop a more robust tool.

PubMed Disclaimer

Figures

**Figure 1**
**Overview of representative predictors**. Predictors are annotated with the basis of their predictions, their cancer-specificity and reliance on each other. The pioneering SIFT method uses conservation information to predict the functional impact of amino acid changes. Several other approaches integrate SIFT results (arrows pointing to SIFT). The power of evolutionary information as an input feature is reflected by the number of classifiers that use conservation for prediction. For example, MAPP, SIFT, Align-GVGD, MutationAssessor and LogRE are predominantly based on conservation. PolyPhen-2 additionally integrates structure to classify mutations as deleterious or benign. Consensus classifiers such as Condel combine multiple predictive tools. The neural network-based SNAP represents one of several recently developed methods that rely on training sets and a large set of discriminatory features. Cancer-specific tools such as mCluster are specifically designed to identify driver mutations and also depend on mutation training sets. The machine learning based method CHASM spans an extensive feature space and is trained on canonical cancer driver mutations. In addition to evolutionary information, CanPredict takes into account gene ontology annotation for classifying oncogenes.

**Figure 2**
**Distribution of missense mutations in COSMIC and dbSNP**. (A) Most somatic non-synonymous mutations in COSMIC were identified in only one tumor sample. 7% of missense mutations were identified in two or more cancer samples. (B) In the dbSNP database global minor allele frequencies are provided for single nucleotide polymorphisms that were identified in the 1000 genomes project. 10% of the missense mutations have a minor allele frequency of 0.25 or higher, which increases their likelihood to be neutral.

**Figure 3**
**SIFT predicts high frequency cancer mutations and low frequency SNPs to be more deleterious**. (A, B) The frequency of mutations in COSMIC correlates with the likelihood to be deleterious according to SIFT score (mutations that are predicted to be deleterious have low SIFT scores). (C, D) The minor allele frequency of dbSNP polymorphisms correlates with the likelihood to be benign according to SIFT score.

**Figure 4**
**Coverage of prediction**. CHASM, MutationA(ssessor), PolyPhen-2, SIFT, Condel, SNAP, and CHASM scored most missense mutations. LogRE and mCluster predictions are restricted to alterations that occur in domain regions and so scored less than 80% of likely cancer drivers (A). Coverage of likely-neutral mutations (B) was broadly similar, but with even lower coverage for LogRE and mCluster due to the lower prevalence of neutral mutations in domains.

**Figure 5**
**Prediction accuracies compared between methods**. ROC curves for 8 predictors scored on COSMIC mutations and prevalent SNPs.

**Figure 6**
**Proportion of true positives and true negatives above certain score thresholds and corresponding score distributions**. Cumulative distributions of true positives and true negatives above certain score cutoffs form the basis for the derivation of weights for our metapredictors. In many cases calculated optimal cutoffs (marked in green) were similar to recommendations from the developers of the tools (marked in red). Both the cumulative distributions and the associated score distributions varied highly between the methods. We transformed raw scores of Snap and MutationAssessor, so that the minimum score is zero.

**Figure 7**
**Somatic mutation in CDKN2A predicted to be deleterious by MutationAssessor**. MutationAssessor predicted the somatic D125N mutation in the canonical tumor suppressor CDKN2A to be deleterious, due to its conservation in mammalian orthologs (Subfamily 1). Other tools used a wider array of homologs, including fish orthologs, where the residue is in fact N, and so classified the mutation to be benign. Using PyMOL (Version 1.2r3pre, Schrödinger, LLC.) the protein structure (PDB id: 1BI7) illustrates CDKN2A (wheat color) in complex with CDK6 (blue). The majority of residues in CDKN2A are known to be implicated in cancer based on UniProt (http://www.uniprot.org) annotation (green). D125 is shown in orange.

**Figure 8**
**Prediction accuracies based on novel recurrent somatic mutations**. (A) ROC curves for recurrent mutations found in the TCGA set. (B) ROC curves for recurrent mutations observed in the COBR set.

See this image and copyright information in PMC

References

1. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–1068. doi: 10.1038/nature07385. - DOI - PMC - PubMed
1. Gonzalez-Perez A, Lopez-Bigas N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. American journal of human genetics. 2011;88(4):440–449. doi: 10.1016/j.ajhg.2011.03.004. - DOI - PMC - PubMed
1. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A. et al.COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research. 2011;39(Database):D945–950. doi: 10.1093/nar/gkq929. - DOI - PMC - PubMed
1. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330–337. doi: 10.1038/nature11252. - DOI - PMC - PubMed
1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29(1):308–311. doi: 10.1093/nar/29.1.308. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessment of computational methods for predicting the effects of missense mutations in human cancers

Affiliation

Assessment of computational methods for predicting the effects of missense mutations in human cancers

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources