Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Nov;81(5):995-1005.
doi: 10.1086/521952. Epub 2007 Sep 19.

Fine mapping versus replication in whole-genome association studies

Affiliations

Fine mapping versus replication in whole-genome association studies

Geraldine M Clarke et al. Am J Hum Genet. 2007 Nov.

Abstract

Association replication studies have a poor track record and, even when successful, often claim association with different markers, alleles, and phenotypes than those reported in the primary study. It is unknown whether these outcomes reflect genuine associations or false-positive results. A greater understanding of these observations is essential for genomewide association (GWA) studies, since they have the potential to identify multiple new associations that that will require external validation. Theoretically, a repeat association with precisely the same variant in an independent sample is the gold standard for replication, but testing additional variants is commonplace in replication studies. Finding different associated SNPs within the same gene or region as that originally identified is often reported as confirmatory evidence. Here, we compare the probability of replicating a gene or region under two commonly used marker-selection strategies: an "exact" approach that involves only the originally significant markers and a "local" approach that involves both the originally significant markers and others in the same region. When a region of high intermarker linkage disequilibrium is tested to replicate an initial finding that is only weak association with disease, the local approach is a good strategy. Otherwise, the most powerful and efficient strategy for replication involves testing only the initially identified variants. Association with a marker other than that originally identified can occur frequently, even in the presence of real effects in a low-powered replication study, and instances of such association increase as the number of included variants increases. Our results provide a basis for the design and interpretation of GWA replication studies and point to the importance of a clear distinction between fine mapping and replication after GWA.

PubMed Disclaimer

Figures

Figure  1.
Figure 1.
Exact and local replication strategies. Exact strategies involve testing only those markers that exceed some significance threshold in the primary study, whereas local replication studies involve testing additional markers on the basis of genomic information, such as LD patterns, marker gaps, and gene locations (shown) or other prior hypotheses. In the local strategy, the markers tested may include some loci that were not deemed “significant” in the initial study, as well as new SNPs that were not tested initially at all.
Figure  2.
Figure 2.
Theoretical probabilities of observing local replication in a region of 50 markers (M=50) as a function of the common value of LD between the true marker and causal alleles. The number, m, of true markers is shown above the panels. The dotted green line shows the probability of achieving a significant result at the representative marker in the original study only. The red line shows the probability of achieving an exact replication. Since original and replicate study samples are assumed to be independent, the value at the red line is the product of the value at the green line and the probability of achieving a significant result at the representative marker in a replicate study when no additional markers are tested. The gray-shaded region represents the range of replication probabilities at all possible levels of LD between marker and disease loci. The upper bound of the gray-shaded region represents the maximum probability of replication for any given level of LD between each marker and a causal locus (X-axis), which occurs when all markers in the region are independent. This upper bound is represented with a dashed black line to emphasize the fact that, when multiple markers in a given region are independent, the possible value of LD between each marker and a single causal locus is constrained, and so the upper bound cannot be attained at all levels of LD between the marker and causal loci. The lower bound of the gray-shaded region represents the minimum probability of replication for any given level of LD between each marker and a causal locus, which occurs when all markers in the region are independent. This lower bound can be attained for all levels of LD between the marker and causal loci. The disease prevalence is 0.05, the GRR is 1.3, and the frequency of the high-risk allele is 0.25. In the first stage, 500,000 markers are genotyped for 3,000 cases and 3,000 controls. In the replication study, K=10 regions are identified, and markers are genotyped for 1,500 cases and 1,500 controls. To ensure an overall type I error rate of 0.05 in the replicate study, the Bonferroni corrected rate is then α=0.05/[10×(50-m+1)] when the true markers are dependent and is α=0.05/(10×50) when the true markers are independent. Panel D duplicates panel A to highlight the specific region for this example in which the maximum probability of replication cannot be attained, emphasizing the implicit requirement of allelic heterogeneity (multiple disease alleles) when the r2 between marker and causal alleles is sufficiently large. For these sample parameter values, this occurs for r2>0.42. (See appendix A for details on calculation of this cutoff value.)
Figure  3.
Figure 3.
Theoretical probabilities of observing local replication in a region of M=50 markers as a function of the common value of LD between m-1 true marker and causal alleles. This graph is identical to figure 2 except that, instead of all m true markers having a common value of LD between marker and causal alleles, one of the additional markers selected is independent of all other markers and is in perfect LD with the causal variant. See the figure 2 legend for more details on the graphs. Values for m are shown above the panels.
Figure  4.
Figure 4.
Pairwise r2 plot for the HapMap CEU data from release 22, April 2007. The intensity of the shading is proportional to the value of r2. A, rs2241008, at 233,848,107 bp on chromosome 2, contained within a single block of LD. B, rs7517847, at 67,454,257 bp on chromosome 1, between two blocks of LD.
Figure  5.
Figure 5.
Maximum probability of local replication as a function of the probability of exact replication in simulations designed to mimic the outcomes of replication that attempts to find an association under the assumption that rs2241880 (A) and rs7517847 (B) are causal SNPs. Each point corresponds to a single simulation. The solid black line is for reference only, indicating when exact and maximum local replication probabilities are equal. See the “Methods” section for full details on simulations.
Figure  6.
Figure 6.
Maximum probability that a locus exceeds the significance threshold in a replication study but is different from the locus initially identified. Each line shows results for a different number of true markers (m), as indicated, tested in a region of M=50 markers in the replicate study and corresponds to the probability that a single true marker is identified in the first study but not in the second study, even though the second study has sufficient power to detect similar nearby markers. The disease prevalence, risk, allele frequencies, sample sizes, marker designs, and type I error are as described in the figure 2 legend.

References

Web Resource

    1. Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for IL23R and Crohn disease)

References

    1. Hirschhorn JN, Daly MJ (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6:95–10810.1038/nrg1521 - DOI - PubMed
    1. Thomas DC, Haile RW, Duggan D (2005) Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 77:337–345 - PMC - PubMed
    1. Wang WY, Barratt BJ, Clayton DG, Todd JA (2005) Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6:109–11810.1038/nrg1522 - DOI - PubMed
    1. Carlson CS, Eberle MA, Kruglyak L, Nickerson DA (2004) Mapping complex disease loci in whole-genome association studies. Nature 429:446–45210.1038/nature02623 - DOI - PubMed
    1. Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE, et al (2005) Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37:1243–124610.1038/ng1653 - DOI - PubMed

Substances

LinkOut - more resources