Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification

Laura L Faye¹, Mitchell J Machiela, Peter Kraft, Shelley B Bull, Lei Sun

Affiliations

PMID: 23950724
PMCID: PMC3738448
DOI: 10.1371/journal.pgen.1003609

Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification

Laura L Faye et al. PLoS Genet. 2013.

. 2013;9(8):e1003609.

doi: 10.1371/journal.pgen.1003609. Epub 2013 Aug 8.

Authors

Laura L Faye¹, Mitchell J Machiela, Peter Kraft, Shelley B Bull, Lei Sun

Affiliation

¹ Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.

PMID: 23950724
PMCID: PMC3738448
DOI: 10.1371/journal.pgen.1003609

Abstract

Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Tagging effect decreases localization success rates with or without the selection effect.**
The expected values of the association test statistics at a tag SNP (red) and the causal SNP (black), shading from 25^th–75^th percentiles (**A, C**), and the localization success rates (**B, D**) for association studies (1000 cases and 1000 controls) of one causal SNP (MAF = 0.12; OR = 1.25; perfect genotyping accuracy) and one tag SNP (MAF = 0.12; in varying degree of correlation with the causal SNP, *r =* 0.2 to 1; perfect genotyping accuracy) with no selection for significance at the tag SNP (**A, B**) or selection at the tag SNP requiring the test statistic *T_G* to be significant with p-value<0.05 (**C, D**).

**Figure 2. Low genotyping accuracy further reduces localization success rates with or without the selection effect.**
Localization success rates for association studies (1000 cases and 1000 controls) of one causal SNP (MAF = 0.12; OR = 1.25; *imperfect genotyping accuracy* due to sequencing or imputation errors resulting in correlation between the actual and estimated genotypes *ρ_C =* 0.80 (blue dash-dotted) to 1 (black solid) and one tag SNP (MAF = 0.12; in varying degree of correlation with the causal SNP, *r_CG =* 0.2 to 1 (X-axis); perfect genotyping accuracy with ρ ***_G =*** 1) with no selection for significance at the tag SNP (A) or selection at the tag SNP requiring the test statistic *T_G* to be significant with p-value<0.05 (B).

**Figure 3. Well-tagged causal SNPs sequenced with low accuracy are unlikely to be correctly identified even as sample size increases.**
Localization success rates for association studies (sample size from 50∶50 cases∶controls to 5000∶5000 cases∶controls, X-axis) of one causal SNP (MAF = 0.12; OR = 1.25; *imperfect genotyping accuracy* due to sequencing or imputation errors resulting in correlation between the actual and estimated genotypes *ρ_C = 0.95*) and one tag SNP (MAF = 0.12; *in high correlation with the causal SNP, r_CG* = 0.8 (purple solid) to 0.98 (red dashed); *100% genotyping accuracy with ρ* _G *= 1*) with no selection for significance at the tag SNP.

**Figure 4. Naïve test statistics and re-ranking statistics for regions surrounding rs78246868 in the 8q24.21 region for association with prostate cancer risk.**
Naïve test statistics (A), and re-ranking statistics adjusting for genotyping accuracy (B) for SNPs in LD (r²>0.2) with rs78246868. Circles highlight SNPs whose rank changed considerably after re-ranking. Color indicates pair-wise correlation with the most significant SNP in the region selected based on the naïve ranking (purple diamond). Other shapes indicate genotyping accuracy over all 7 studies as measured by *ρ_meta*. rs78246868 is no longer the most significant SNP in the region after re-ranking.

**Figure 5. Naïve test statistics and re-ranking statistics for regions surrounding rs8071558 in the 17q24.3 region for association with prostate cancer risk.**
Naïve test statistics (A), and re-ranking statistics adjusting for genotyping accuracy (B) for SNPs in LD (r²>0.2) with rs8071558. Circles highlight SNPs whose rank changed considerably after re-ranking. Color indicates pair-wise correlation with the most significant SNP in the region selected based on the naïve ranking (purple diamond). Other shape indicates genotyping accuracy over all 7 studies as measured by *ρ_meta*, rs8071558 is no longer the most significant SNP in the region after re-ranking.

See this image and copyright information in PMC

References

1. Cooper GM, Shendure J (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 12: 628–40. - PubMed
1. Georges M (2011) The long and winding road from correlation to causation. Nat Genet 43 3: 180–1. - PubMed
1. Ioannidis JP, Thomas G, Daly MJ (2009) Validating, augmenting and refining genome-wide association signals. Nat Rev Genet 10: 318–29. - PMC - PubMed
1. Zaitlen N, Paşaniuc B, Gur T, Ziv E, Halperin E (2010) Leveraging genetic variability across populations for the identification of causal variants. Am J Hum Genet 86: 23–33. - PMC - PubMed
1. Udler MS, Tyrer J, Easton DF (2010) Evaluating the power to discriminate between highly correlated SNPs in genetic association studies. Genet Epidemiol 34 5: 463–8. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification

Affiliation

Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials