. 2011 Feb 15;12 Suppl 1(Suppl 1):S26.

doi: 10.1186/1471-2105-12-S1-S26.

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Can Yang¹, Xiang Wan, Zengyou He, Qiang Yang, Hong Xue, Weichuan Yu

Affiliations

PMID: 21342556
PMCID: PMC3044281
DOI: 10.1186/1471-2105-12-S1-S26

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Can Yang et al. BMC Bioinformatics. 2011.

. 2011 Feb 15;12 Suppl 1(Suppl 1):S26.

doi: 10.1186/1471-2105-12-S1-S26.

Authors

Can Yang¹, Xiang Wan, Zengyou He, Qiang Yang, Hong Xue, Weichuan Yu

Affiliation

¹ Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. eeyang@ust.hk

PMID: 21342556
PMCID: PMC3044281
DOI: 10.1186/1471-2105-12-S1-S26

Abstract

Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.

Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the null distribution is not appropriately chosen. This is because screening and modeling may change the null distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of null distributions. To choose appropriate null distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.

Conclusions: The permutation test or testing on the independent data set can help choosing appropriate null distributions in hypothesis testing, which provides more reliable results in practice.

PubMed Disclaimer

Figures

**Figure 1**
**A toy example illustrating the effect of the inappropriate choice of null distributions.** Null distribution 1 follows the and null distribution 2 follows the . The observed χ² value is 10, and the P-values are 0.0404 and 0.2650 for these two null distributions, respectively. Suppose P = 0.05 is the threshold of hypothesis testing. Then P = 0.0404 indicates a significant result, while P = 0.2650 does not. If the true null distribution is , then the use of will give many false positive results.

formula image — **Figure 1**
**A toy example illustrating the effect of the inappropriate choice of null distributions.** Null distribution 1 follows the and null distribution 2 follows the . The observed χ² value is 10, and the P-values are 0.0404 and 0.2650 for these two null distributions, respectively. Suppose P = 0.05 is the threshold of hypothesis testing. Then P = 0.0404 indicates a significant result, while P = 0.2650 does not. If the true null distribution is , then the use of will give many false positive results.

**Figure 2**
**Null distributions affected by MDR modeling.** The null distributions are estimated using 500 simulated null data sets. Each null data set contains n = 2000 samples. **Upper panel:** From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting null distributions follows χ² distributions with df = 4.84, 11.40, 30.41, respectively. **Lower panel:** Each null data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ² distributions.

**Figure 3**
**Null distributions affected by screening methods.** The null distributions are estimated using 500 simulated null data sets. The null distributions shown in the lower panel of Figure 2, serve as the reference distributions (df = 13.76 for d* = 2, df = 29.49 for d* = 3 and df = 64.64 for d* = 4). The screening methods ReliefF, TURF and SURFSTAR are used to reduce the number of SNPs from L = 2000 to d = 20. For the remaining d = 20 SNPs, MDR is used to identify the best d*-way interactions (d* = 2, 3, 4). The resulting null distributions of these models, shown from left to right, do not strictly follow the χ² distribution. The null distributions shift rightwards, compared with those distributions in the lower panel of Figure 2.

**Figure 4**
**Null distributions of testing on the independent data set.** We generate 500 null data sets. Each data set has 2000 samples and 2000 SNPs. We divide each data set into three subsets with nearly equal size. The first one is used for screening, the second one is for modeling and the third one is for hypothesis testing. **The upper panel**: Logistic regression (LR) is used in modeling. The degrees of freedom of the theoretical null distributions are df = 8,26,80 for 2,3,4-way interaction models, respectively. We see that the null distributions of LR match the theoretical null distributions well for 2,3-way interaction models. The resulting null distribution of the 4-way interaction model follows Here df = 73.18 is smaller than the theoretical one (df = 80) because there are only about 666 samples in hypothesis testing. The number of samples is too small to accurately estimate the large degree of freedom of the theoretical null distribution (df = 80). **The lower panel:** MDR is used in modeling. We can see that the obtained null distributions are roughly the same with those shown in the upper panel of Figure 2.

**Figure 5**
**Null distributions obtained using the permutation test.** We conduct B = 500 permutations for the AMD data set, as described in the method section. The P-values obtained by the permutation test for models M₁, M₂, M₃, M₄ are 0.0040, 0.1180, 0.2880 and 0.1480, respectively. Only model M₁ is significant. The claim of the significance of the high order interaction (rs1535891, rs2828151, rs404569, rs380390) based on Model M₄ in Table 2 is inappropriate.

**Figure 6**
**The procedure of using independent data sets in hypothesis testing.** The whole data set D is partitioned into three subsets: D⁽¹⁾, D⁽²⁾ and D⁽³⁾. A screening method is applied to D⁽¹⁾. After screening, only a subset of features survives, denoted as A₁. Then modeling methods are applied to D⁽²⁾, but only involving the features in A₁. This modeling process may further select a subset of features from A₁, denoted as A₂. Thus, A₂ ⊂ A₁. For feature assessment, hypothesis testing is applied to the features in A₂ using the data set D⁽³⁾. The correction factor for multiple testing is calculated based on the size of A₂. After feature assessment, the significant features are collected in A₃ and A₃ ⊂ A₂. They are finally used for genetic mapping.

See this image and copyright information in PMC

References

1. WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
1. Balding D. A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006;7:781–791. doi: 10.1038/nrg1916. - DOI - PubMed
1. Eichler E, Flint J, Gibson G, Kong A, Leal S, Moore J, Nadeau J. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics. 2010;11(6):446–450. doi: 10.1038/nrg2809. - DOI - PMC - PubMed
1. Cordell H. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. - DOI - PMC - PubMed
1. Nelson M, Kardia S, Ferrell R, Sing C. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research. 2001;11(3):458. doi: 10.1101/gr.172901. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Affiliation

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources