Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May;36(5):513-23.
doi: 10.1002/humu.22768. Epub 2015 Mar 26.

The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity

Affiliations

The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity

Dominik G Grimm et al. Hum Mutat. 2015 May.

Abstract

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Keywords: exome sequencing; pathogenicity prediction tools.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Evaluation of the 10 different pathogenicity prediction tools (by AUC) over five datasets. The hatched bars indicate potentially biased results, due to the overlap (or possible overlap) between the evaluation data and the data used (by tool developers) for training the prediction tool. The dotted bars indicate that the tool is biased due to type 2 circularity. The protein MV predictor and the logistic regression (over the features used in the weighting scheme of FatHMM‐W) are discussed in the second part of the Results section.
Figure 2
Figure 2
In the VariBenchSelected dataset, most SNPs are in genes with only neutral or only pathogenic variants. A: Protein perspective: proportion of proteins containing only neutral variants (“neutral‐only”), only pathogenic variants (“pathogenic‐only”), and both types of variants (“mixed”). Only 1.4% of the proteins are mixed. B: Variant perspective: proportions, of variants in each of the three categories of proteins. Only 5.2% of variants are in mixed proteins. C: Fractions of variants, in the VariBenchSelected dataset, containing various ratios of pathogenic‐to‐neutral variants, binned into increasingly narrow bins, approaching balanced proteins. The open interval ]0.0, 1.0[ contains all mixed proteins (as in B). Only 0.7% of all variants belong to almost perfectly balanced proteins (closed interval [0.4, 0.6]).
Figure 3
Figure 3
Performance of 10 pathogenicity prediction tools according to protein pathogenic‐to‐neutral variant ratio. Evaluation of tool performance on subsets of VariBenchSelected, predictSNPSelected, and SwissVarSelected, defined according to the relative proportions of pathogenic and neutral variants in the proteins they contain. “Pure” indicates variants belonging to proteins containing only one class of variant. (x and y) indicate variants belonging to mixed proteins, containing a ratio of pathogenic‐to‐neutral variants between x and y. ]0.0, 1.0[ therefore indicate all mixed proteins (the ratios of 0.0 and 1.0 being excluded by the reversed brackets). While FatHMM‐W performs well or excellently on variants belonging to pure proteins (VariBenchSelected and predictSNPSelected), it performs poorly on those belonging to mixed proteins.
Figure 4
Figure 4
Comparison of the performance of two metapredictors (Logit and Condel) and their component tools, across five datasets. Bar heights reflect AUC for each tool and tool combination. Logit and Condel are metapredictors combining MASS, PP2, and SIFT. The “+” versions of Logit and Condel also include FatHMM‐W. While effective in prediction, FATHMM‐W (alone and in the Logit+ and Condel+ metapredictors) is optimistically biased due to type 2 circularity (see Results section). In the “Selected” datasets, Logit provides the best unbiased performance. SIFT has the lowest performance in the HumVar and ExoVar datasets, but it is also the only predictor that is unbiased in these two datasets.

Similar articles

Cited by

References

    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7:248–249. - PMC - PubMed
    1. Bendl J, Stourac J, Salanda O, Pavelka A, Wieben ED, Zendulka J, Brezovsky J, Damborsky J. 2014. PredictSNP: robust and accurate consensus classifier for prediction of disease‐related mutations. PLoS Comput Biol 10:e1003440. - PMC - PubMed
    1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. - PMC - PubMed
    1. Capriotti E, Altman RB, Bromberg Y. 2013. Collective judgment predicts disease‐associated single nucleotide variants. BMC Genomics 14:S2. - PMC - PubMed
    1. Chun S, Fay JC. 2009. Identification of deleterious mutations within three human genomes. Genome Res 19:1553–1561. - PMC - PubMed

Publication types