Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies

Frank Dudbridge¹, Bobby P C Koeleman

Affiliations

PMID: 15266393
PMCID: PMC1182021
DOI: 10.1086/423738

Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies

Frank Dudbridge et al. Am J Hum Genet. 2004 Sep.

. 2004 Sep;75(3):424-35.

doi: 10.1086/423738. Epub 2004 Jul 19.

Authors

Frank Dudbridge¹, Bobby P C Koeleman

Affiliation

¹ MRC Rosalind Franklin Centre for Genomics Research, and MRC Biostatistics Unit, Cambridge, United Kingdom. frank.dudbridge@mrc-bsu.cam.ac.uk

PMID: 15266393
PMCID: PMC1182021
DOI: 10.1086/423738

Abstract

Large exploratory studies, including candidate-gene-association testing, genomewide linkage-disequilibrium scans, and array-expression experiments, are becoming increasingly common. A serious problem for such studies is that statistical power is compromised by the need to control the false-positive rate for a large family of tests. Because multiple true associations are anticipated, methods have been proposed that combine evidence from the most significant tests, as a more powerful alternative to individually adjusted tests. The practical application of these methods is currently limited by a reliance on permutation testing to account for the correlated nature of single-nucleotide polymorphism (SNP)-association data. On a genomewide scale, this is both very time-consuming and impractical for repeated explorations with standard marker panels. Here, we alleviate these problems by fitting analytic distributions to the empirical distribution of combined evidence. We fit extreme-value distributions for fixed lengths of combined evidence and a beta distribution for the most significant length. An initial phase of permutation sampling is required to fit these distributions, but it can be completed more quickly than a simple permutation test and need be done only once for each panel of tests, after which the fitted parameters give a reusable calibration of the panel. Our approach is also a more efficient alternative to a standard permutation test. We demonstrate the accuracy of our approach and compare its efficiency with that of permutation tests on genomewide SNP data released by the International HapMap Consortium. The estimation of analytic distributions for combined evidence will allow these powerful methods to be applied more widely in large exploratory studies.

PubMed Disclaimer

Figures

**Figure 1**
Parameters of the extreme-value distribution for S_k as function of length *k. A,* Location, chromosome 18. B, Location, chromosome 21. C, Scale, chromosome 18. D, Scale, chromosome 21. E, Shape, chromosome 18. F, Shape, chromosome 21.

**Figure 2**
Quantile-quantile plot of extreme-value distribution for S_k. A, Length 10, chromosome 18. B, Length 10, chromosome 21. C, Length 100, chromosome 18. D, Length 100, chromosome 21.

**Figure 3**
Quantile-quantile plot of beta distribution for the minimum P value for S_k. A, Chromosome 18, parameters (.8032,1.3766). B, Chromosome 21, parameters (.7932,1.3423).

**Figure 4**
Effect of assumption of independence in correlated tests. Solid line shows the P value of the 95th percentile of the empirical distribution, under assumption of independent tests. Dotted line shows the P value according to the fitted extreme-value distribution.

**Figure 5**
Power of fixed-length sum compared with variable-length sum, for 10,000 tests. A, Five true associations with χ² NCP 15. B, Ten associations with NCP 11. C, Fifty associations with NCP 5. D, One hundred associations with NCP 3.

See this image and copyright information in PMC

References

1. Austin MA, Harding S, McElroy C (2003) Genebanks: a comparison of eight proposed international genetic databases. Community Genet 6:37–4510.1159/000069544 - DOI - PubMed
1. Bailey TL, Grundy WN (1999) Classifying proteins by family using the product of correlated p-values. Paper presented at the Third International Conference on Computational Molecular Biology, Lyon, France, April 11–14
1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
1. Brown BW, Russell K (1997) Methods of correcting for multiple testing: operating characteristics. Stat Med 16:2511–252810.1002/(SICI)1097-0258(19971130)16:22<2511::AID-SIM693>3.0.CO;2-4 - DOI - PubMed
1. Chapman JM, Cooper JD, Todd JA, Clayton DG (2003) Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered 56:18–3110.1159/000073729 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies

Affiliation

Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources