Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May;35(4):269-77.
doi: 10.1002/gepi.20575.

Efficient study design for next generation sequencing

Affiliations

Efficient study design for next generation sequencing

Joshua Sampson et al. Genet Epidemiol. 2011 May.

Abstract

Next Generation Sequencing represents a powerful tool for detecting genetic variation associated with human disease. Because of the high cost of this technology, it is critical that we develop efficient study designs that consider the trade-off between the number of subjects (n) and the coverage depth (µ). How we divide our resources between the two can greatly impact study success, particularly in pilot studies. We propose a strategy for selecting the optimal combination of n and µ for studies aimed at detecting rare variants and for studies aimed at detecting associations between rare or uncommon variants and disease. For detecting rare variants, we find the optimal coverage depth to be between 2 and 8 reads when using the likelihood ratio test. For association studies, we find the strategy of sequencing all available subjects to be preferable. In deriving these combinations, we provide a detailed analysis describing the distribution of depth across a genome and the depth needed to identify a minor allele in an individual. The optimal coverage depth depends on the aims of the study, and the chosen depth can have a large impact on study success.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Comparing the observed distribution of coverage depth to the distribution estimated by the model. Graphs are for the three subjects, with the lowest, median, and highest, coverage depth, among the 116 individuals genotyped at the Sanger Center. The black dots show the true proportion of SNPs with the specified read depth (x-axis), the red/unbroken line shows the distribution of a negative binomial with that subject’s mean depth and ζ = 4, and the blue/dashed line shows the poisson distribution.
Fig. 2
Fig. 2
(A) The power to detect a heterozygote individual as a function of average depth and α-level when r = 0.01, pMA = 0.5, and assuming Kij follows a negative binomial distribution. (B) The power to detect a heterozygote individual for different values of ζ (α = 10−5). Note the change in x-axis. (C) The power to detect a heterozygote individual for different values of pMA, or different read biases (α = 10−5, ζ = 4).
Fig. 3
Fig. 3
Main Figure: (A) The power to detect a rare variant for all possible combinations of n (number of subjects) and μ (depth of sequencing) using a likelihood ratio test, with λj~Poisson, r = 0.001, a≡ α = 0.0001, and MAF = 0.005. (B) The black/unbroken line is power as a function of μ when n × μ = 500 with the above parameters. The red/dashed and green/dotted lines show the μ/power relationships when r = 0.01 and a≡α = 0.01, respectively. (C) The black/unbroken line shows the optimal λ as a function of read error rate, with α = 0.0001, MAF = 0.005, n × μ = 500, and λ~Poisson. The red/dashed line shows the optimal λ when α is raised to 0.01.
Fig. 4
Fig. 4
For fixed cost, the power to detect an association decreases with μ. The sharpness of the decline depends on the relative risk attributable to the SNP. The relationship between power and μ is illustrated for four different values of the relative risk when MAF = 0.05, r = 0.01, α = 0.0001, and n × μ = 50,000.

References

    1. Bansal V, Harismendy O, Tewhey R, Murray SS, Schork NJ, Topol EJ, Frazer KA. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 2010;20:537–545. - PMC - PubMed
    1. Bhangale TR, Rieder MJ, Nickerson DA. Estimating coverage and power for genetic association studies using near-complete variation data. Nat Genet. 2008;40:841–843. - PubMed
    1. Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat Rev Genet. 2006;7:85–97. - PubMed
    1. Harismendy O, Ng P, Strausberg R, Wang X, Stockwell T, Beeson K, Schork N, Murray S, Topol E, Levy S, Frazer K. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. - PMC - PubMed
    1. Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G. Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform. 2010;11:181–197. - PubMed

LinkOut - more resources