Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;73(3):139-47.
doi: 10.1159/000337300. Epub 2012 Jun 7.

Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Affiliations

Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Guolian Kang et al. Hum Hered. 2012.

Abstract

Next-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1. SNP discovery in stage I when all SNPs are in LE and the cost reduction of the two-stage designs based on simulation data
The black x-axis is the proportion of stage I individuals times 100 (l × 100), and the yellow x-axis is the cost function with Ts/Tw = 0.5. The black, red, green, blue, and pink lines indicate the percentages of the total discovered SNPs, CVs, LCVs, RVs, and causal SNPs, respectively. Letters “e” and “r” indicate two-stage designs with extreme phenotype sampling (TS-E) or random sampling (TS-R), respectively.
Figure 1
Figure 1. SNP discovery in stage I when all SNPs are in LE and the cost reduction of the two-stage designs based on simulation data
The black x-axis is the proportion of stage I individuals times 100 (l × 100), and the yellow x-axis is the cost function with Ts/Tw = 0.5. The black, red, green, blue, and pink lines indicate the percentages of the total discovered SNPs, CVs, LCVs, RVs, and causal SNPs, respectively. Letters “e” and “r” indicate two-stage designs with extreme phenotype sampling (TS-E) or random sampling (TS-R), respectively.
Figure 2
Figure 2. The power of TS-E under the 5 disease models when all SNPs are in LE based on 1000 sets of simulated data
The black x-axis is the proportion of stage I individuals times 100 (l × 100). Solid black line with letter “o”: one-stage design using the sum test; Solid red line with letter “e”: the two-stage design with extreme phenotype sampling (TS-E) using the sum test; Solid green line with letter “r”: the two-stage design with random sampling (TS-R) using the sum test; Dashed black line with letter “o”: one-stage design using the sum of squares (SSU) test; Dashed red line with letter “e”: the TS-E using the SSU test; Dashed green line with letter “r”: TS-R using the SSU test.
Figure 3
Figure 3. SNP discovery of the two-stage design with extreme phenotype sampling (TS-E) in the GAW17 data
The black x-axis is the proportion of stage I individuals times 100 (l × 100), and the yellow x-axis is the cost function with Ts/Tw = 0.5. The black, red, green, blue, and pink lines correspond to the percentage of total discovered SNPs, CVs, LCVs, discovered RVs, and causal SNPs, respectively.
Figure 4
Figure 4. The − log p-values for the 13 causal genes under the two-stage design with extreme phenotype sampling (TS-E) in the GAW17 data (l=0.0035)
The y-axis is −log10 (p-value). The numbers 1, 2, 3, and 4 respectively correspond to results of TS-E using the sum test, one-stage design using sum test, TS-E using the SSU test, and one-stage design using the SSU test. The green and red lines indicate the cutoffs for the one-stage design and TS-E with l = 0.0035 using Bonferroni correction.

Similar articles

Cited by

References

    1. Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Ann Rev Med. 2010;61:437–455. - PubMed
    1. Schaid DJ, Sinnwell JP. Two-stage case-control designs for rare genetic variants. Hum Genet. 2010;127:659–68. - PMC - PubMed
    1. Bansal V, Tewhey R, LeProust EM, Schork NJ. Efficient and cost effective population resequencing by pooling and in-solution hybridization. PLOS One. 2011;6:e18353. - PMC - PubMed
    1. Kim SY, Li Y, Guo Y, Li R, Holmkvist J, et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet Epidemiol. 2010;34:479–491. - PMC - PubMed
    1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev Genet. 2010;11:415–425. - PubMed

Publication types