Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May 20;44(6):631-5.
doi: 10.1038/ng.2283.

Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Affiliations

Extremely low-coverage sequencing and imputation increases power for genome-wide association studies

Bogdan Pasaniuc et al. Nat Genet. .

Abstract

Genome-wide association studies (GWAS) have proven to be a powerful method to identify common genetic variants contributing to susceptibility to common diseases. Here, we show that extremely low-coverage sequencing (0.1-0.5×) captures almost as much of the common (>5%) and low-frequency (1-5%) variation across the genome as SNP arrays. As an empirical demonstration, we show that genome-wide SNP genotypes can be inferred at a mean r(2) of 0.71 using off-target data (0.24× average coverage) in a whole-exome study of 909 samples. Using both simulated and real exome-sequencing data sets, we show that association statistics obtained using extremely low-coverage sequencing data attain similar P values at known associated variants as data from genotyping arrays, without an excess of false positives. Within the context of reductions in sample preparation and sequencing costs, funds invested in extremely low-coverage sequencing can yield several times the effective sample size of GWAS based on SNP array data and a commensurate increase in statistical power.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Genotype imputation accuracy as function of coverage in 1000 Genomes Project simulations. Accuracy as function of coverage is displayed using solid lines for common SNPs (MAF >5%) and dashed lines for low-frequency SNPs (MAF<5%).
Figure 2
Figure 2
Observed versus expected association minus log 10 p-values at 103,977 SNPs across the genome in simulated null data sets over 909 samples of the combined data set. We observe r2 of 0.64 between p-values computed in typed versus imputed data, similar to simulations of association statistics at imputed versus genotyping calls (Supplementary Note). Results for alternate hypothesis of association can be found in Supplementary Note.
Figure 3
Figure 3
Genotype imputation accuracy in IHCS whole-exome data as a function of coverage. Illumina 1M genotype calls were used as a gold standard, restricting to 6070 SNPs in 10 distinct 5Mb regions (total of 50Mb) of the genome (see main text). Dotted lines denote results attained in 1000 Genomes simulations on the same SNP set.
Figure 4
Figure 4
Coverage (and corresponding number of samples) for fixed budget of $300,000. (a) Effective sample size in sequencing-based GWAS as function of number of samples and resulting coverage. Cost assumptions: $30 per sample preparation cost, $133 per 1x sequencing cost (see main text). (b) Ratio of expected association statistic (effective sample size) in sequencing-based GWAS vs. array-based GWAS at $400/sample, as a function of sample preparation and sequencing costs. Expected association statistics for sequencing-based GWAS are based on optimum coverage and number of samples (assuming arbitrarily large number of samples available) subject to budget constraint. The optimum coverage and number of samples varies at different points on the graph (not shown). Black dot denotes $30 sample preparation cost and $133 per 1x.
Figure 4
Figure 4
Coverage (and corresponding number of samples) for fixed budget of $300,000. (a) Effective sample size in sequencing-based GWAS as function of number of samples and resulting coverage. Cost assumptions: $30 per sample preparation cost, $133 per 1x sequencing cost (see main text). (b) Ratio of expected association statistic (effective sample size) in sequencing-based GWAS vs. array-based GWAS at $400/sample, as a function of sample preparation and sequencing costs. Expected association statistics for sequencing-based GWAS are based on optimum coverage and number of samples (assuming arbitrarily large number of samples available) subject to budget constraint. The optimum coverage and number of samples varies at different points on the graph (not shown). Black dot denotes $30 sample preparation cost and $133 per 1x.

References

    1. Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. doi: 10.1073/pnas.0903103106. - DOI - PMC - PubMed
    1. Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. - DOI - PMC - PubMed
    1. Depristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. - DOI - PMC - PubMed
    1. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11:499–511. doi: 10.1038/nrg2796. - DOI - PubMed
    1. Altshuler DM, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding