Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Guolian Kang¹, Dongyu Lin, Hakon Hakonarson, Jinbo Chen

Affiliations

PMID: 22678112
PMCID: PMC3558993
DOI: 10.1159/000337300

Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Guolian Kang et al. Hum Hered. 2012.

. 2012;73(3):139-47.

doi: 10.1159/000337300. Epub 2012 Jun 7.

Authors

Guolian Kang¹, Dongyu Lin, Hakon Hakonarson, Jinbo Chen

Affiliation

¹ Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104, USA.

PMID: 22678112
PMCID: PMC3558993
DOI: 10.1159/000337300

Abstract

Next-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies.

PubMed Disclaimer

Figures

**Figure 1. SNP discovery in stage I when all SNPs are in LE and the cost reduction of the two-stage designs based on simulation data**
The black x-axis is the proportion of stage I individuals times 100 (l × 100), and the yellow x-axis is the cost function with *T_s*/*T_w* = 0.5. The black, red, green, blue, and pink lines indicate the percentages of the total discovered SNPs, CVs, LCVs, RVs, and causal SNPs, respectively. Letters “e” and “r” indicate two-stage designs with extreme phenotype sampling (TS-E) or random sampling (TS-R), respectively.

**Figure 2. The power of TS-E under the 5 disease models when all SNPs are in LE based on 1000 sets of simulated data**
The black x-axis is the proportion of stage I individuals times 100 (l × 100). Solid black line with letter “o”: one-stage design using the sum test; Solid red line with letter “e”: the two-stage design with extreme phenotype sampling (TS-E) using the sum test; Solid green line with letter “r”: the two-stage design with random sampling (TS-R) using the sum test; Dashed black line with letter “o”: one-stage design using the sum of squares (SSU) test; Dashed red line with letter “e”: the TS-E using the SSU test; Dashed green line with letter “r”: TS-R using the SSU test.

**Figure 3. SNP discovery of the two-stage design with extreme phenotype sampling (TS-E) in the GAW17 data**
The black x-axis is the proportion of stage I individuals times 100 (l × 100), and the yellow x-axis is the cost function with *T_s*/*T_w* = 0.5. The black, red, green, blue, and pink lines correspond to the percentage of total discovered SNPs, CVs, LCVs, discovered RVs, and causal SNPs, respectively.

**Figure 4. The − log p-values for the 13 causal genes under the two-stage design with extreme phenotype sampling (TS-E) in the GAW17 data (l=0.0035)**
The y-axis is −log₁₀ (p-value). The numbers 1, 2, 3, and 4 respectively correspond to results of TS-E using the sum test, one-stage design using sum test, TS-E using the SSU test, and one-stage design using the SSU test. The green and red lines indicate the cutoffs for the one-stage design and TS-E with l = 0.0035 using Bonferroni correction.

See this image and copyright information in PMC

Cited by

Cancer pharmacogenomics: strategies and challenges.
Wheeler HE, Maitland ML, Dolan ME, Cox NJ, Ratain MJ. Wheeler HE, et al. Nat Rev Genet. 2013 Jan;14(1):23-34. doi: 10.1038/nrg3352. Epub 2012 Nov 27. Nat Rev Genet. 2013. PMID: 23183705 Free PMC article. Review.
Phenotypic extremes in rare variant study designs.
Peloso GM, Rader DJ, Gabriel S, Kathiresan S, Daly MJ, Neale BM. Peloso GM, et al. Eur J Hum Genet. 2016 Jun;24(6):924-30. doi: 10.1038/ejhg.2015.197. Epub 2015 Sep 9. Eur J Hum Genet. 2016. PMID: 26350511 Free PMC article.
A Systematic Review of Extreme Phenotype Strategies to Search for Rare Variants in Genetic Studies of Complex Disorders.
Amanat S, Requena T, Lopez-Escamez JA. Amanat S, et al. Genes (Basel). 2020 Aug 25;11(9):987. doi: 10.3390/genes11090987. Genes (Basel). 2020. PMID: 32854191 Free PMC article.
Low-, high-coverage, and two-stage DNA sequencing in the design of the genetic association study.
Xu C, Wu K, Zhang JG, Shen H, Deng HW. Xu C, et al. Genet Epidemiol. 2017 Apr;41(3):187-197. doi: 10.1002/gepi.22015. Epub 2016 Nov 4. Genet Epidemiol. 2017. PMID: 27813156 Free PMC article.
Comparison of mixed model based approaches for correcting for population substructure with application to extreme phenotype sampling.
Onifade M, Roy-Gagnon MH, Parent MÉ, Burkett KM. Onifade M, et al. BMC Genomics. 2022 Feb 4;23(1):98. doi: 10.1186/s12864-022-08297-y. BMC Genomics. 2022. PMID: 35120458 Free PMC article.

See all "Cited by" articles

References

1. Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Ann Rev Med. 2010;61:437–455. - PubMed
1. Schaid DJ, Sinnwell JP. Two-stage case-control designs for rare genetic variants. Hum Genet. 2010;127:659–68. - PMC - PubMed
1. Bansal V, Tewhey R, LeProust EM, Schork NJ. Efficient and cost effective population resequencing by pooling and in-solution hybridization. PLOS One. 2011;6:e18353. - PMC - PubMed
1. Kim SY, Li Y, Guo Y, Li R, Holmkvist J, et al. Design of association studies with pooled or un-pooled next-generation sequencing data. Genet Epidemiol. 2010;34:479–491. - PMC - PubMed
1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev Genet. 2010;11:415–425. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- figshare - Access datasets and other research materials.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Affiliation

Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources