. 2013 Sep;37(6):622-34.

doi: 10.1002/gepi.21743. Epub 2013 Jul 8.

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Hao Hu¹, Chad D Huff, Barry Moore, Steven Flygare, Martin G Reese, Mark Yandell

Affiliations

PMID: 23836555
PMCID: PMC3791556
DOI: 10.1002/gepi.21743

Free PMC article

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Hao Hu et al. Genet Epidemiol. 2013 Sep.

Free PMC article

. 2013 Sep;37(6):622-34.

doi: 10.1002/gepi.21743. Epub 2013 Jul 8.

Authors

Hao Hu¹, Chad D Huff, Barry Moore, Steven Flygare, Martin G Reese, Mark Yandell

Affiliation

¹ Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA.

PMID: 23836555
PMCID: PMC3791556
DOI: 10.1002/gepi.21743

Abstract

The need for improved algorithmic support for variant prioritization and disease-gene identification in personal genomes data is widely acknowledged. We previously presented the Variant Annotation, Analysis, and Search Tool (VAAST), which employs an aggregative variant association test that combines both amino acid substitution (AAS) and allele frequencies. Here we describe and benchmark VAAST 2.0, which uses a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation. We show that the CASM approach improves VAAST's variant prioritization accuracy compared to its previous implementation, and compared to SIFT, PolyPhen-2, and MutationTaster. We also show that VAAST 2.0 outperforms KBAC, WSS, SKAT, and variable threshold (VT) using published case-control datasets for Crohn disease (NOD2), hypertriglyceridemia (LPL), and breast cancer (CHEK2). VAAST 2.0 also improves search accuracy on simulated datasets across a wide range of allele frequencies, population-attributable disease risks, and allelic heterogeneity, factors that compromise the accuracies of other aggregative variant association tests. We also demonstrate that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease-gene finders with a simple ranking-by-statistical-significance protocol, and the performance compares very favorably to state-of-art filtering approaches. The latter, despite their popularity, have suboptimal performance especially with the increasing case sample size.

Keywords: aggregative association test; complex disease; disease-gene finder; rare Mendelian disease; variant classifier.

PubMed Disclaimer

Figures

**Figure 1**
Receiver operator curves (ROC) for the variant prioritization tools. Shown are ROCs for VAAST 1.0, VAAST 2.0, CASM, SIFT, PolyPhen-2, and MutationTaster, using two benchmark datasets: (A) common and rare variants from HGMD and 1000 Genomes Project; (B) *BRCA1* and *BRCA2* rare variant set. x-axis: false-positive rate; y-axis: true positive rate. Dashed line denotes the false-positive rate of 0.05.

**Figure 2**
Power comparisons over three published common disease datasets. (A) *NOD2*, (B) *LPL*, (C) *CHEK2*. The x-axis shows the number of case genomes and the y-axis shows the statistical power. The power is calculated based on 100 bootstraps.

**Figure 3**
Impact of PAR. Shown is the power of six association tests under different total population attributable risk (PAR) levels. x-axis shows the total PAR values from all contributing variants; y-axis shows the statistical power based on 100 bootstraps. (A) Dominant model, (B) recessive model. The number of cases and control are set at 1,000, with the number of disease-causal alleles and noncausal alleles both fixed at 50.

**Figure 4**
Impact of different proportions of deleterious mutation sites contributing to the disease risk. x-axis is the proportion of deleterious mutation sites among all simulated sites; y-axis statistical power. (A) Dominant model; (B) recessive model. Total PAR is fixed at 10%; the numbers of case/controls are set at 500; the number of casual variants is 50 with varying number of noncasual variants.

**Figure 5**
Impact of differing numbers of deleterious mutation sites. x-axis is the number of deleterious mutation sites (ND); y-axis shows the statistical power based on 100 bootstraps. (A) Dominant model, (B) recessive model. The number of cases and control are set at 500, and the total PAR value is set at 10%.

**Figure 6**
Rankings for 100 different genome-wide searches for known rare disease genes. Panels (A) and (B) shows dominant and recessive models, respectively. The different colors denote the proportion of the 100 OMIM "target" genes falling into four bins based upon genome-wide rank (see insert legend), with orange, denoting the percentage cases for which the disease gene was ranked among the top 10 candidates genome-wide. Dominant and recessive disease scenarios are investigated separately. To model the dominant diseases, one causal variant was inserted into the gene of interest, and in the recessive cases two different alleles are inserted (per case genome). For each algorithm, three columns are shown, corresponding one individual, two individuals, and three individuals as cases.

See this image and copyright information in PMC

References

1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–249. - PMC - PubMed
1. Altshuler D, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Collins FS, Vega FMDl, Donnelly P, Egholm M. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. and others. - PMC - PubMed
1. Cooper DN, Ball EV, Krawczak M. The human gene mutation database. Nucleic Acids Res. 1998;26(1):285–287. - PMC - PubMed
1. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327(5961):78–81. and others. - PubMed
1. Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, Allen-Brady K, Tavtigian SV, Monteiro AN, Iversen ES, Couch FJ. A systematic genetic assessment of 1,433 sequence variants of unknown clinical significance in the BRCA1 and BRCA2 breast cancer-predisposition genes. Am J Hum Genet. 2007;81(5):873–883. and others. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Affiliation

VAAST 2.0: improved variant classification and disease-gene identification using a conservation-controlled amino acid substitution matrix

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials