Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 15;12 Suppl 1(Suppl 1):S26.
doi: 10.1186/1471-2105-12-S1-S26.

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Affiliations

The choice of null distributions for detecting gene-gene interactions in genome-wide association studies

Can Yang et al. BMC Bioinformatics. .

Abstract

Background: In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.

Results: In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the null distribution is not appropriately chosen. This is because screening and modeling may change the null distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of null distributions. To choose appropriate null distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.

Conclusions: The permutation test or testing on the independent data set can help choosing appropriate null distributions in hypothesis testing, which provides more reliable results in practice.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A toy example illustrating the effect of the inappropriate choice of null distributions. Null distribution 1 follows the formula image and null distribution 2 follows the formula image. The observed χ2 value is 10, and the P-values are 0.0404 and 0.2650 for these two null distributions, respectively. Suppose P = 0.05 is the threshold of hypothesis testing. Then P = 0.0404 indicates a significant result, while P = 0.2650 does not. If the true null distribution is formula image, then the use of formula image will give many false positive results.
Figure 2
Figure 2
Null distributions affected by MDR modeling. The null distributions are estimated using 500 simulated null data sets. Each null data set contains n = 2000 samples. Upper panel: From left to right, each data set has L = 2, L = 3, L = 4 SNPs. MDR can be applied to these data sets without model search to fit the two-factor model (d = 2), the three-factor model (d = 3), and the four-factor model (d = 4). The resulting null distributions follows χ2 distributions with df = 4.84, 11.40, 30.41, respectively. Lower panel: Each null data set contains n = 2000 samples and L = 20 SNPs. MDR is directly applied to each data set. MDR searches all possible models and cross-validation is used to assess each model. The best two-factor model (d* = 2), the best three-factor model (d* = 3), and the best four-factor model (d* = 4) are identified. Their distributions, shown from left to right, do not strictly follow χ2 distributions.
Figure 3
Figure 3
Null distributions affected by screening methods. The null distributions are estimated using 500 simulated null data sets. The null distributions shown in the lower panel of Figure 2, serve as the reference distributions (df = 13.76 for d* = 2, df = 29.49 for d* = 3 and df = 64.64 for d* = 4). The screening methods ReliefF, TURF and SURFSTAR are used to reduce the number of SNPs from L = 2000 to d = 20. For the remaining d = 20 SNPs, MDR is used to identify the best d*-way interactions (d* = 2, 3, 4). The resulting null distributions of these models, shown from left to right, do not strictly follow the χ2 distribution. The null distributions shift rightwards, compared with those distributions in the lower panel of Figure 2.
Figure 4
Figure 4
Null distributions of testing on the independent data set. We generate 500 null data sets. Each data set has 2000 samples and 2000 SNPs. We divide each data set into three subsets with nearly equal size. The first one is used for screening, the second one is for modeling and the third one is for hypothesis testing. The upper panel: Logistic regression (LR) is used in modeling. The degrees of freedom of the theoretical null distributions are df = 8,26,80 for 2,3,4-way interaction models, respectively. We see that the null distributions of LR match the theoretical null distributions well for 2,3-way interaction models. The resulting null distribution of the 4-way interaction model follows formula image Here df = 73.18 is smaller than the theoretical one (df = 80) because there are only about 666 samples in hypothesis testing. The number of samples is too small to accurately estimate the large degree of freedom of the theoretical null distribution (df = 80). The lower panel: MDR is used in modeling. We can see that the obtained null distributions are roughly the same with those shown in the upper panel of Figure 2.
Figure 5
Figure 5
Null distributions obtained using the permutation test. We conduct B = 500 permutations for the AMD data set, as described in the method section. The P-values obtained by the permutation test for models M1, M2, M3, M4 are 0.0040, 0.1180, 0.2880 and 0.1480, respectively. Only model M1 is significant. The claim of the significance of the high order interaction (rs1535891, rs2828151, rs404569, rs380390) based on Model M4 in Table 2 is inappropriate.
Figure 6
Figure 6
The procedure of using independent data sets in hypothesis testing. The whole data set D is partitioned into three subsets: D(1), D(2) and D(3). A screening method is applied to D(1). After screening, only a subset of features survives, denoted as A1. Then modeling methods are applied to D(2), but only involving the features in A1. This modeling process may further select a subset of features from A1, denoted as A2. Thus, A2A1. For feature assessment, hypothesis testing is applied to the features in A2 using the data set D(3). The correction factor for multiple testing is calculated based on the size of A2. After feature assessment, the significant features are collected in A3 and A3A2. They are finally used for genetic mapping.

Similar articles

Cited by

References

    1. WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
    1. Balding D. A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006;7:781–791. doi: 10.1038/nrg1916. - DOI - PubMed
    1. Eichler E, Flint J, Gibson G, Kong A, Leal S, Moore J, Nadeau J. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics. 2010;11(6):446–450. doi: 10.1038/nrg2809. - DOI - PMC - PubMed
    1. Cordell H. Detecting gene-gene interactions that underlie human diseases. Nature Reviews Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. - DOI - PMC - PubMed
    1. Nelson M, Kardia S, Ferrell R, Sing C. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research. 2001;11(3):458. doi: 10.1101/gr.172901. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources