Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Dec 1;18(1):535.
doi: 10.1186/s12859-017-1951-y.

Using variant databases for variant prioritization and to detect erroneous genotype-phenotype associations

Affiliations

Using variant databases for variant prioritization and to detect erroneous genotype-phenotype associations

Bart J G Broeckx et al. BMC Bioinformatics. .

Abstract

Background: In the search for novel causal mutations, public and/or private variant databases are nearly always used to facilitate the search as they result in a massive reduction of putative variants in one step. Practically, variant filtering is often done by either using all variants from the variant database (called the absence-approach, i.e. it is assumed that disease-causing variants do not reside in variant databases) or by using the subset of variants with an allelic frequency > 1% (called the 1%-approach). We investigate the validity of these two approaches in terms of false negatives (the true disease-causing variant does not pass all filters) and false positives (a harmless mutation passes all filters and is erroneously retained in the list of putative disease-causing variants) and compare it with an novel approach which we named the quantile-based approach. This approach applies variable instead of static frequency thresholds and the calculation of these thresholds is based on prior knowledge of disease prevalence, inheritance models, database size and database characteristics.

Results: Based on real-life data, we demonstrate that the quantile-based approach outperforms the absence-approach in terms of false negatives. At the same time, this quantile-based approach deals more appropriately with the variable allele frequencies of disease-causing alleles in variant databases relative to the 1%-approach and as such allows a better control of the number of false positives. We also introduce an alternative application for variant database usage and the quantile-based approach. If disease-causing variants in variant databases deviate substantially from theoretical expectancies calculated with the quantile-based approach, their association between genotype and phenotype had to be reconsidered in 12 out of 13 cases.

Conclusions: We developed a novel method and demonstrated that this so-called quantile-based approach is a highly suitable method for variant filtering. In addition, the quantile-based approach can also be used for variant flagging. For user friendliness, lookup tables and easy-to-use R calculators are provided.

Keywords: 1000 Genomes project variant database; Allele frequency; HapMap; Variant database; Variant filtering; dbSNP.

PubMed Disclaimer

Conflict of interest statement

Ethics approval

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
An overview of variant filtering (method 1) and variant flagging (method 2). A. method 1: In a sequencing study, a hypothetical list of 7 variants was discovered, with variant 4 being the causal variant and the other ones harmless co-inherited mutations. Inside the variant database, 5 out of 7 variants discovered during sequencing (including variant 4) are already represented with varying allele frequencies f (allele frequency db-column). Three different approaches for variant filtering can be used. Candidate variants that are filtered out, are denoted with an X. Candidate variants that are retained after filtering are denoted with a ✓. By assuming absence of disease-causing variants from variant databases (absence-approach), the disease-causing variant was erroneously filtered out. The same issue was encountered by using a static 1% threshold. The quantile-based approach was used to calculate a suitable allelic frequency threshold Tv. Based on the disease prevalence P d of 1 in 10,000 individuals and an autosomal recessive mode of inheritance, the population allele frequency q is 0.01. For a variant database of 50 individuals (= 100 chromosomes, situation a), the Tv associated with the 95th quantile equals 0.03 (3/100). While the allele frequency f of the disease-causing variant in the variant database (= 0.02) is slightly higher than the theoretically expected population allele frequency (= 0.01) due to sampling variability, the Tv cut-off (0.03) has made it possible to discover the true disease-causing variant, while this was not the case for the other two approaches. B. method 2: this analysis determines how likely it is that a disease-causing variant (variant 4) occurs at least twice in a variant database of 50 individuals (= 100 chromosomes, situation a), given P d equals 1 in 10,000 and an autosomal mode of inheritance. Based on the binomial distribution, this probability equals 0.26. As such, there is insufficient evidence to conclude that this model is inappropriate
Fig. 2
Fig. 2
Actual allelic frequencies f of the disease-causing mutations for 30 autosomal recessive disorders. For a total of 1169 disease-causing mutations, the allelic frequency f was plotted, relative to the static 1% threshold and the variable quantile-based thresholds. For all variants, it was indicated whether they were correctly classified. Disease prevalence is expressed as 1/n (with n ranging from 0 to 1 000 000)
Fig. 3
Fig. 3
Relation between disease prevalence and proportion of the variant database available for filtering. The proposed mode of inheritance is autosomal recessive, the disease prevalence is expressed as 1/n (with n ranging from 1000 to 100,000). Both the variable quantile-based approach and the static 1%-approach are depicted. By definition, for the absence-approach all variants (100%) are available (not shown)

References

    1. Sherry ST, Ward M, Sirotkin K. dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 1999;9:677–679. - PubMed
    1. Altshuler DM, Gibbs RA, Peltonen L. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. - DOI - PMC - PubMed
    1. The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. - DOI - PMC - PubMed
    1. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. - DOI - PMC - PubMed
    1. Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol. 2011;12:227. doi: 10.1186/gb-2011-12-9-227. - DOI - PMC - PubMed

LinkOut - more resources