GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm

Affiliations

PMID: 23950726
PMCID: PMC3738451
DOI: 10.1371/journal.pgen.1003657

GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm

Leonardo Bottolo et al. PLoS Genet. 2013.

. 2013;9(8):e1003657.

doi: 10.1371/journal.pgen.1003657. Epub 2013 Aug 8.

Affiliation

¹ Department of Mathematics, Imperial College London, London, United Kingdom. l.bottolo@imperial.ac.uk

PMID: 23950726
PMCID: PMC3738451
DOI: 10.1371/journal.pgen.1003657

Abstract

Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n > 100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Schematic representation of the analysis of single and multiple phenotypes using GUESS.**
(A–B) Given a group of single traits (APOA1, APOB, HDL, LDL and TG), we constructed two top-down trees (green and blue colour coded) made by biologically driven combinations of phenotypes and centred on the pathways of LDL (A) and HDL (B). Each branch of the trees was regressed on the whole set of tagged SNPs (∼273K SNPs) using GUESS and adjusting for sex, age and body mass index. (C) Output from GUESS is used to derive the Best Models Visited (BMV), i.e. the most supported multivariate models, and their Model Posterior Probability (MPP), i.e. the fraction of the model space explained by the BMV (MPP of the top BMV and the cumulative MPP of the top five BMV are indicated in the first two columns, respectively). Based on an empirical FDR procedure, we selected a parsimonious set of significant SNPs (indicated on the top of the table with the associated locus) that explains the variation of each branch of the two trees. Merging this information with the list of SNPs in the top BMV allowed us to highlight a robust subset of significant SNPs that repeatedly contribute to the top supported model (significant SNPs are depicted in black whereas significant SNPs that are also in the top BMV are indicated in red). For each SNPs, comparison of the marginal strength of association across different combinations of traits is possible by a new rescaled measure of marginal phenotype-SNP association, Ratio of Bayes Factors (RBF) (phenotype-SNP log₁₀(RBF) is truncated at 20 to increase readability). Based on Ensembl R66 annotation, each locus is classified as: (1) intronic, (2) 3′UTR, (3) downstream, (4) previously associated and (5) a tagSNP of a previously associated SNP. The name of the locus is also reported on the right of each branch of the two trees with the same colour code used in the table: black if the locus is associated with the phenotypes with FDR<5%, red if the locus is also in the top BMV with FDR<5%.

**Figure 2. Comparison of the marginal phenotype-SNP associations provided by GUESS, SNPTEST and piMASS in the single trait analysis of TG.**
(To increase readability, the log₁₀(BFs) are truncated at 20). (A) Genome-wide log₁₀(BF) obtained from GUESS. Significant SNPs found associated at an FDR of 5% are depicted by black dots (with the SNP's name) whereas significant SNPs that are also in the top Best Model Visited are represented by red dots (also with the SNP's name). (B) Genome-wide log₁₀(BF) obtained from SNPTEST. The horizontal dashed line indicates the level of log₁₀(BF) that provides strong evidence of a phenotype-SNP association with Marginal Posterior Probability of inclusion close to 1. For comparison purposes, SNPs detected by GUESS are highlighted (their name is printed). SNPs found by SNPTEST with log₁₀(BF)>5 are coloured coded according to the level of pairwise Pearson correlation with the closest significant GUESS SNP (see colour bar for correlation scale). (C) Genome-wide log₁₀(BF) obtained from piMASS. The horizontal dashed line indicates the level of log₁₀(BF) that provides strong evidence for a phenotype-SNP association. (D) log₁₀(BF) signals obtained from SNPTEST in a region of chromosome 11 spanning nearly 500 Kb (116,519,739–116,845,104 bp). The horizontal dashed line and colour code used to identify relevant SNPs are the same as defined in (B). Top bars indicate the position of genes in the region retrieved from Ensembl R66. (E) Scatterplot of genome-wide log₁₀(BF) of TG obtained from GUESS and SNPTEST. Colour code used to identify relevant SNPs and the horizontal dashed line are as defined in (A) and (B). (F) Scatterplot of genome-wide log₁₀(BF) of TG obtained from GUESS and piMASS. The colour code used to identify relevant SNPs and the horizontal dashed line are as defined in (A) and (B).

**Figure 3. Comparison of the marginal phenotype-SNP associations provided by GUESS and SNPTEST in the multiple traits analysis of TG-LDL-APOB.**
(To increase readability, the log₁₀(BFs) are truncated at 20). (A) Genome-wide log₁₀(BF) obtained from GUESS. Significant SNPs found associated at 5% FDR are depicted by black dots (with the SNP's name) whereas significant SNPs that are also in the top Best Model Visited are represented by red dots (with the SNP's name). (B) Genome-wide log₁₀(BF) obtained from SNPTEST. The horizontal dashed line indicates the level of log₁₀(BF) that provides strong evidence of a phenotype-SNP association with Marginal Posterior Probability of inclusion close to 1. For comparison purposes, SNPs found by GUESS are highlighted (their name is printed). SNPs with log₁₀(BF)>5 are coloured coded according to the level of pairwise Pearson correlation with the closest significant GUESS SNP (see colour bar for correlation scale). (C) log₁₀(BF) signal obtained from SNPTEST in a region of chromosome 11 spanning nearly 500 Kb (116,519,739–116,845,104 bp). The horizontal dashed line and colour code used to identify relevant SNPs are as defined in (B). Top bars indicate the position of genes in the region retrieved from Ensembl R66. (D) Scatterplot of genome-wide log₁₀(BF) of TG-LDL-APOB obtained from GUESS and SNPTEST. The colour code used to identify relevant SNPs and the horizontal dashed line are as defined in (A) and (B).

Figure 4. Receiver Operating Characteristic (ROC) curves of SNPTEST (black), SPLS (blue), MLASSO (dark green), (M)ANOVA (purple), piMASS (green) and GUESS (red) for multiple traits and single trait simulated datasets.
For GUESS, ROC curves are obtained using the top Best Model Visited (BMV) (red star) and the Marginal Posterior Probability of Inclusion (MPPI) (solid red line). For SNPTEST, the ROC curve is calculated using log₁₀(BF) while for piMASS ROC curves are obtained using MPPI. (Average) number of SNPs retained by SPLS and MLASSO under different levels of penalization are indicated (A–B). For MANOVA Wilks (A–B) and ANOVA Kruskal (C–D), the ROC curve is derived using the SNPs declared significant over a range of FDR levels. Number of false positives (x-axis) is indicated at the top of the figure while proportion of false positives is presented at the bottom. Given the large number of predictors (273,294), false positives are truncated at 10⁻⁴ at which level a large number already occurs (27.5).

See this image and copyright information in PMC

References

1. Sabatti C, Service SK, Hartikainen AL, Pouta A, Ripatti S, et al. (2009) Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet 41: 677–687. - PMC - PubMed
1. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707–713. - PMC - PubMed
1. Brown PJ, Vannucci M, Fearn T (1998) Multivariate Bayesian variable selection and prediction. J Roy Stat Soc B 60: 627–641.
1. Denison DGT, Holmes CC, Mallick BK, Smith AFM (2002) Bayesian Methods for Nonlinear Classification and Regression. New York: Wiley.
1. Monni S, Tadesse MG (2009) A stochastic partitioning method to associate high-dimensional responses and covariates (with discussion). Bayesian Analysis 4: 413–436.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm

Affiliation

GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous