Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Jul;79(1):100-12.
doi: 10.1086/505313. Epub 2006 May 30.

Bayesian graphical models for genomewide association studies

Affiliations

Bayesian graphical models for genomewide association studies

Claudio J Verzilli et al. Am J Hum Genet. 2006 Jul.

Abstract

As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain-Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval.

PubMed Disclaimer

Figures

Figure  1.
Figure 1.
A decomposable graph and the junction tree representation of its cliques 𝒞1,…,𝒞5 and separators 𝒮1,…,𝒮5, with 𝒮1≡𝒮5≡∅. Nodes correspond to genotype data at nine marker loci and a disease-status indicator.
Figure  2.
Figure 2.
Example of a current graph in the MCMC scheme. A region of six markers is depicted with two cliques containing noncontiguous markers, 𝒞1=(G1,G3,G4) and 𝒞2=(G2,G5,G6). 𝒞2 has label T=1 because it contains markers currently associated with D, 𝒮1=(G5,G6).
Figure  3.
Figure 3.
Single-locus χ2 tests (formula image), marginal posterior probability (prob) of association and Bayes factor in favor of association from the graphical modeling approach for a single replicated data set in the simulation study. The location of the disease-susceptibility locus is indicated with an asterisk (*).
Figure  4.
Figure 4.
Mean location error (kb) as a function of mean false-positive rates over 100 replicated data sets and a dominant model. The shaded boxes above each panel identify the different scenarios, which vary in GRR at a single causative site (1.5, 2, and 2.5) and SNP marker density (1 every 5 kb, 2.5 kb, and 1.7 kb). The MAF of the high-risk variant is 0.05.
Figure  5.
Figure 5.
Mean location error (kb) as a function of false-negative rates over 100 replicated data sets and a dominant model. The shaded boxes above each panel identify the different scenarios, which vary in GRR at a single causative site (1.5, 2, and 2.5) and SNP marker density (1 every 5 kb, 2.5 kb, and 1.7 kb). The MAF of the high-risk variant is 0.05.
Figure  6.
Figure 6.
Mean false-positive results as a function of proportion of maximum Bayes factor or minimum P value over 100 replicated data sets and a dominant model. Different curves correspond to different window widths around a single causative site used to define a false-positive result: ±60 kb (straight lines), ±30 kb (dashed lines), and ±20 kb (dotted lines). The shaded boxes above each panel identify the different scenarios, which vary in GRR (1.5, 2, and 2.5) and SNP marker density (1 every 5 kb, 2.5 kb, and 1.7 kb). The MAF of the high-risk variant is 0.05. Triangles represent single-locus χ2 analyses; circles represent the Bayesian graphical model.
Figure  7.
Figure 7.
Mean location error (kb) as a function of mean false-positive rates across 100 replicated data sets and a dominant model. The shaded boxes above each panel identify the different scenarios, which vary in GRR at a single causative site (1.5, 2, and 2.5) and SNP marker density (1 every 5 kb, 2.5 kb, and 1.7 kb). The MAF of the high-risk variant is 0.10.
Figure  8.
Figure 8.
Mean false-positive results as a function of proportion of maximum Bayes factor or minimum P value over 100 replicated data sets and a dominant model. Different curves correspond to different window widths around a single causative site used to define a false-positive result: ±60 kb (straight lines), ±30 kb (dashed lines), and ±20 kb (dotted lines). The shaded boxes above each panel identify the different scenarios, which vary in GRR (1.5, 2, and 2.5) and SNP marker density (1 every 5 kb, 2.5 kb, and 1.7 kb). The MAF of the high-risk variant is 0.10. Triangles indicate single-locus χ2 analyses; circles indicate the Bayesian graphical model.
Figure  9.
Figure 9.
Trace plots of the number of cliques in a graph corresponding to a single simulation replicate, from two separate MCMC runs. The initial clique size is 1 (solid line) or 8 (dotted line).
Figure  10.
Figure 10.
Bayes factors in favor of association and single-locus formula image of association for a synthetic data set composed of 100K SNPs and embedded real data from the CYP2D6 gene region. The location of CYP2D6 is indicated by an asterisk (*) and dashed vertical line on the two X-axes in each panel.

References

Web Resource

    1. C.J.V.’s Web site, http://homepages.lshtm.ac.uk/~encdcver/ (for R package Graphminer)

References

    1. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT (2005) Mapping determinants of human gene expression by regional and genomewide association. Nature 437:1365–136910.1038/nature04244 - DOI - PMC - PubMed
    1. Farrall M, Morris AP (2005) Gearing up for genomewide gene-association studies. Hum Mol Genet 14:R157–R16210.1093/hmg/ddi273 - DOI - PubMed
    1. Maraganore DM, de Andrade M, Lesnick TG, Strain KJ, Farrer MJ, Rocca WA, Pant PVK, Frazer KA, Cox DR, Ballinger DG (2005) High-resolution whole-genome association study of Parkinson disease. Am J Hum Genet 77:685–693 - PMC - PubMed
    1. Lawrence RW, Evans DM, Cardon LR (2005) Prospects and pitfalls in whole genome association studies. Philos Trans R Soc Lond B Biol Sci 360:1589–159510.1098/rstb.2005.1689 - DOI - PMC - PubMed
    1. Thomas DC, Haile RW, Duggan D (2005) Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 77:337–345 - PMC - PubMed

Publication types

LinkOut - more resources