Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep;37(6):622-34.
doi: 10.1002/gepi.21743. Epub 2013 Jul 8.

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Affiliations
Free PMC article

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Hao Hu et al. Genet Epidemiol. 2013 Sep.
Free PMC article

Abstract

The need for improved algorithmic support for variant prioritization and disease-gene identification in personal genomes data is widely acknowledged. We previously presented the Variant Annotation, Analysis, and Search Tool (VAAST), which employs an aggregative variant association test that combines both amino acid substitution (AAS) and allele frequencies. Here we describe and benchmark VAAST 2.0, which uses a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation. We show that the CASM approach improves VAAST's variant prioritization accuracy compared to its previous implementation, and compared to SIFT, PolyPhen-2, and MutationTaster. We also show that VAAST 2.0 outperforms KBAC, WSS, SKAT, and variable threshold (VT) using published case-control datasets for Crohn disease (NOD2), hypertriglyceridemia (LPL), and breast cancer (CHEK2). VAAST 2.0 also improves search accuracy on simulated datasets across a wide range of allele frequencies, population-attributable disease risks, and allelic heterogeneity, factors that compromise the accuracies of other aggregative variant association tests. We also demonstrate that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease-gene finders with a simple ranking-by-statistical-significance protocol, and the performance compares very favorably to state-of-art filtering approaches. The latter, despite their popularity, have suboptimal performance especially with the increasing case sample size.

Keywords: aggregative association test; complex disease; disease-gene finder; rare Mendelian disease; variant classifier.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Receiver operator curves (ROC) for the variant prioritization tools. Shown are ROCs for VAAST 1.0, VAAST 2.0, CASM, SIFT, PolyPhen-2, and MutationTaster, using two benchmark datasets: (A) common and rare variants from HGMD and 1000 Genomes Project; (B) BRCA1 and BRCA2 rare variant set. x-axis: false-positive rate; y-axis: true positive rate. Dashed line denotes the false-positive rate of 0.05.
Figure 2
Figure 2
Power comparisons over three published common disease datasets. (A) NOD2, (B) LPL, (C) CHEK2. The x-axis shows the number of case genomes and the y-axis shows the statistical power. The power is calculated based on 100 bootstraps.
Figure 3
Figure 3
Impact of PAR. Shown is the power of six association tests under different total population attributable risk (PAR) levels. x-axis shows the total PAR values from all contributing variants; y-axis shows the statistical power based on 100 bootstraps. (A) Dominant model, (B) recessive model. The number of cases and control are set at 1,000, with the number of disease-causal alleles and noncausal alleles both fixed at 50.
Figure 4
Figure 4
Impact of different proportions of deleterious mutation sites contributing to the disease risk. x-axis is the proportion of deleterious mutation sites among all simulated sites; y-axis statistical power. (A) Dominant model; (B) recessive model. Total PAR is fixed at 10%; the numbers of case/controls are set at 500; the number of casual variants is 50 with varying number of noncasual variants.
Figure 5
Figure 5
Impact of differing numbers of deleterious mutation sites. x-axis is the number of deleterious mutation sites (ND); y-axis shows the statistical power based on 100 bootstraps. (A) Dominant model, (B) recessive model. The number of cases and control are set at 500, and the total PAR value is set at 10%.
Figure 6
Figure 6
Rankings for 100 different genome-wide searches for known rare disease genes. Panels (A) and (B) shows dominant and recessive models, respectively. The different colors denote the proportion of the 100 OMIM "target" genes falling into four bins based upon genome-wide rank (see insert legend), with orange, denoting the percentage cases for which the disease gene was ranked among the top 10 candidates genome-wide. Dominant and recessive disease scenarios are investigated separately. To model the dominant diseases, one causal variant was inserted into the gene of interest, and in the recessive cases two different alleles are inserted (per case genome). For each algorithm, three columns are shown, corresponding one individual, two individuals, and three individuals as cases.

References

    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–249. - PMC - PubMed
    1. Altshuler D, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Collins FS, Vega FMDl, Donnelly P, Egholm M. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. and others. - PMC - PubMed
    1. Cooper DN, Ball EV, Krawczak M. The human gene mutation database. Nucleic Acids Res. 1998;26(1):285–287. - PMC - PubMed
    1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. and others. - PubMed
    1. Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, Allen-Brady K, Tavtigian SV, Monteiro AN, Iversen ES, Couch FJ. A systematic genetic assessment of 1,433 sequence variants of unknown clinical significance in the BRCA1 and BRCA2 breast cancer-predisposition genes. Am J Hum Genet. 2007;81(5):873–883. and others. - PMC - PubMed

Publication types