Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 4;103(4):522-534.
doi: 10.1016/j.ajhg.2018.08.016. Epub 2018 Sep 27.

Burden Testing of Rare Variants Identified through Exome Sequencing via Publicly Available Control Data

Affiliations

Burden Testing of Rare Variants Identified through Exome Sequencing via Publicly Available Control Data

Michael H Guo et al. Am J Hum Genet. .

Abstract

The genetic causes of many Mendelian disorders remain undefined. Factors such as lack of large multiplex families, locus heterogeneity, and incomplete penetrance hamper these efforts for many disorders. Previous work suggests that gene-based burden testing-where the aggregate burden of rare, protein-altering variants in each gene is compared between case and control subjects-might overcome some of these limitations. The increasing availability of large-scale public sequencing databases such as Genome Aggregation Database (gnomAD) can enable burden testing using these databases as controls, obviating the need for additional control sequencing for each study. However, there exist various challenges with using public databases as controls, including lack of individual-level data, differences in ancestry, and differences in sequencing platforms and data processing. To illustrate the approach of using public data as controls, we analyzed whole-exome sequencing data from 393 individuals with idiopathic hypogonadotropic hypogonadism (IHH), a rare disorder with significant locus heterogeneity and incomplete penetrance against control subjects from gnomAD (n = 123,136). We leveraged presumably benign synonymous variants to calibrate our approach. Through iterative analyses, we systematically addressed and overcame various sources of artifact that can arise when using public control data. In particular, we introduce an approach for highly adaptable variant quality filtering that leads to well-calibrated results. Our approach "re-discovered" genes previously implicated in IHH (FGFR1, TACR3, GNRHR). Furthermore, we identified a significant burden in TYRO3, a gene implicated in hypogonadotropic hypogonadism in mice. Finally, we developed a user-friendly software package TRAPD (Test Rare vAriants with Public Data) for performing gene-based burden testing against public databases.

Keywords: TRAPD; gene-based burden analysis; hypogonadotropic hypogonadism.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Burden Testing Scheme Case cohort sequencing (IHH) and control database sequencing (gnomAD) data are processed separately, and burden testing is performed in the final step. For each set of data, sequencing quality filters, predicted variant pathogenicity filters, and sample filters (e.g., ancestry) can be applied. Then, counts of qualifying variant carriers for each gene in the case and control subjects are generated. Finally, burden testing is performed.
Figure 2
Figure 2
Effect of Coverage on Distribution of Synonymous Variants (A) Quantile-quantile plot of initial burden testing results using synonymous SNVs. Synonymous variants were used as they are likely mostly benign and can be used to test the null distribution. The x axis represents the expected –log10(p value) under the uniform distribution of p values. The y axis shows the observed –log10(p value) from the burden testing data. Each point is a single gene. Red dots represent the 35 genes previously implicated in IHH, while black dots represent the remaining genes in the genome. The black solid line shows the relationship between expected and observed p values under the uniform p value distribution. The dotted blue line shows the observed fit line between the 50th and 95th percentile of genes; the slope of this line is λΔ95. (B) Coverage at HRNR in case sequencing data and gnomAD control database. Exons are shown in yellow boxes below the plot, with wider boxes representing coding regions and narrower boxes representing UTRs. Introns (not drawn to scale) are shown as connecting lines between exons. Red dots represent coverage (as proportion of individuals with read depth >10×) in case cohort sequencing, while blue dots represent coverage in gnomAD control database. Each dot represents a single base. The dashed line represents the threshold for 90% of samples having sequencing read depth >10×. (C) Repeat QQ plot from (A), except considering only bases for which more than 90% of samples had sequencing read depth >10× in both gnomAD and case sequencing data.
Figure 3
Figure 3
Effect of Variant Quality Filters on Distribution of Synonymous Variants (A) Effect of adding pass/fail filters for variant quality. QQ plot of burden testing results following filtering for sites that passed GATK quality filters in the case and control sequencing data. (B) Burden testing using QD scores to filter for sites. Only top 95% of sites in gnomAD based on QD scores and top 85% of sites in the case cohort sequencing based on QD scores are used. Only sites where more than 90% of samples had sequencing read depth >10× in both gnomAD and the case cohort sequencing were considered (same as Figure 2B). QQ plots show burden testing results for synonymous variants.
Figure 4
Figure 4
Selection of Damaging Variants to Improve the Power of Rare Variant Burden Testing (A) Burden testing using all protein-altering variants. (B) Distribution of PolyPhen2 (PP2), SIFT, and CADD scores among missense variants observed in IHH-affected case subjects as compared to gnomAD. (C) Burden testing using only PTVs (essential splice site, frameshift, and nonsense) and missense variants computationally predicted to be damaging are considered. (D) Burden testing using only PTVs. For (A), (C), and (D), the same filters for coverage as in Figure 2B and variant quality as in Figure 3B were applied.
Figure 5
Figure 5
Addition of Indels to Rare Variant Burden Testing For case cohort sequencing, SNVs in the top 85% of QD scores and indels in the top 75% were considered. For gnomAD, SNVs in the top 95% of QD scores and indels in the top 85% were considered. QQ plot shows burden testing using all nonsynonymous variants (A), PTVs (splice site, frameshift, and nonsense) plus missense variants computationally predicted to be damaging (B), or PTVs only (C).

References

    1. Guo M.H., Dauber A., Lippincott M.F., Chan Y.-M., Salem R.M., Hirschhorn J.N. Determinants of power in gene-based burden testing for monogenic disorders. Am. J. Hum. Genet. 2016;99:527–539. - PMC - PubMed
    1. Cirulli E.T., Lasseigne B.N., Petrovski S., Sapp P.C., Dion P.A., Leblond C.S., Couthouis J., Lu Y.-F., Wang Q., Krueger B.J. Exome sequencing in amyotrophic lateral sclerosis identifies risk genes and pathways. Science. 2015;347:1436–1441. - PMC - PubMed
    1. Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 2014;95:5–23. - PMC - PubMed
    1. Moutsianas L., Agarwala V., Fuchsberger C., Flannick J., Rivas M.A., Gaulton K.J., Albers P.K., McVean G., Boehnke M., Altshuler D., McCarthy M.I., GoT2D Consortium The power of gene-based rare variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet. 2015;11:e1005165. - PMC - PubMed
    1. Li B., Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 2008;83:311–321. - PMC - PubMed

Publication types

Supplementary concepts