Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

doi:10.1186/1471-2105-10-294

. 2009 Sep 17:10:294.

doi: 10.1186/1471-2105-10-294.

Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

Waranyu Wongseree¹, Anunchai Assawamakin, Theera Piroonratana, Saravudh Sinsomros, Chanin Limwongse, Nachol Chaiyaratana

Affiliations

Affiliation

¹ Department of Electrical Engineering, Faculty of Engineering, King Mongkut's University of Technology North Bangkok, Bangkok, Thailand. waranyu.wongseree@gmail.com

PMID: 19761607
PMCID: PMC2759961
DOI: 10.1186/1471-2105-10-294

Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

Waranyu Wongseree et al. BMC Bioinformatics. 2009.

. 2009 Sep 17:10:294.

doi: 10.1186/1471-2105-10-294.

Authors

Waranyu Wongseree¹, Anunchai Assawamakin, Theera Piroonratana, Saravudh Sinsomros, Chanin Limwongse, Nachol Chaiyaratana

Affiliation

¹ Department of Electrical Engineering, Faculty of Engineering, King Mongkut's University of Technology North Bangkok, Bangkok, Thailand. waranyu.wongseree@gmail.com

PMID: 19761607
PMCID: PMC2759961
DOI: 10.1186/1471-2105-10-294

Abstract

Background: Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.

Results: The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global p-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in PGM1 (phosphoglucomutase 1), two intronic SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two intronic SNPs in PARK2 (Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in GYS2 (glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.

Conclusion: An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.

PubMed Disclaimer

Figures

**Figure 1**
**Outline of 2LOmb**. In this example, the algorithm takes a balanced case-control data set that consists of 400 samples and 1,000 SNPs. Each genotype is represented by an integer: 0 denotes a homozygous wild-type genotype, 1 denotes a heterozygous genotype and 2 denotes a homozygous variant or homozygous mutant genotype. A χ²contingency table is then constructed for each pair of SNPs in two-locus analysis. This results in the total of = 499,500 two-locus analyses. Thus, the Bonferroni-corrected χ²'s p-value for each two-locus analysis is the lower value between 499,500 × its uncorrected p-value and one. In one ensemble, Bonferroni-corrected χ²'s p-values from multiple two-locus analyses are combined together via a Fisher's combining function, which in turn provides a Fisher's test statistic result. The raw p-value for the ensemble is obtained through a permutation test, which is composed of 10,000 randomised permutation replicates. Since multiple ensembles may be tried during the identification of the best association explanation, a global p-value is calculated to account for multiple hypothesis testing. The global p-value is estimated through the same permutation test that gives the raw p-value for each ensemble. The progressive search for the best association explanation is carried out by incrementally adding a two-SNP unit to the current best ensemble. The condition for search termination is based on both the raw p-value for the explored ensemble and the global p-value. In this example, the search is terminated after the fourth ensemble is explored due to an increase in the raw p-value. Subsequently, the best SNP set for association explanation contains SNP1, SNP2 and SNP3 where the global p-value that accounts for testing of four hypotheses is p < 0.0001.

formula image — **Figure 1**
**Outline of 2LOmb**. In this example, the algorithm takes a balanced case-control data set that consists of 400 samples and 1,000 SNPs. Each genotype is represented by an integer: 0 denotes a homozygous wild-type genotype, 1 denotes a heterozygous genotype and 2 denotes a homozygous variant or homozygous mutant genotype. A χ²contingency table is then constructed for each pair of SNPs in two-locus analysis. This results in the total of = 499,500 two-locus analyses. Thus, the Bonferroni-corrected χ²'s p-value for each two-locus analysis is the lower value between 499,500 × its uncorrected p-value and one. In one ensemble, Bonferroni-corrected χ²'s p-values from multiple two-locus analyses are combined together via a Fisher's combining function, which in turn provides a Fisher's test statistic result. The raw p-value for the ensemble is obtained through a permutation test, which is composed of 10,000 randomised permutation replicates. Since multiple ensembles may be tried during the identification of the best association explanation, a global p-value is calculated to account for multiple hypothesis testing. The global p-value is estimated through the same permutation test that gives the raw p-value for each ensemble. The progressive search for the best association explanation is carried out by incrementally adding a two-SNP unit to the current best ensemble. The condition for search termination is based on both the raw p-value for the explored ensemble and the global p-value. In this example, the search is terminated after the fourth ensemble is explored due to an increase in the raw p-value. Subsequently, the best SNP set for association explanation contains SNP1, SNP2 and SNP3 where the global p-value that accounts for testing of four hypotheses is p < 0.0001.

**Figure 2**
**Performance of the exhaustive two-locus analysis, SAA, CFS, TuRF and 2LOmb in the null data problem**. The results are averaged over 25 independent simulations. False detection is declared for the exhaustive two-locus analysis, SAA and 2LOmb if the p-values used as detection indicators in their results are less than 0.05. The results from the exhaustive two-locus analysis (E2LA), SAA, CFS, TuRF and 2LOmb are displayed using magenta, blue, green, red and black markers, respectively. In each chart, the horizontal axis represents the detection algorithm while the vertical axis represents the number of output SNPs reported by the algorithm. The top nine charts are displayed using a finer scale than the bottom nine charts.

**Figure 3**
**Performance of the exhaustive two-locus analysis and 2LOmb in the two-locus interaction problem**. The results are averaged over 25 independent simulations. Detection is declared for the exhaustive two-locus analysis and 2LOmb if the p-values used as detection indicators in their results are less than 0.05. The results from the exhaustive two-locus analysis (E2LA) and 2LOmb are displayed using magenta and black markers, respectively. In each chart, the horizontal axis represents the detection algorithm while the vertical axis represents the number of output SNPs reported by the algorithm. All causative SNPs are present in outputs from both the exhaustive two-locus analysis and 2LOmb in all simulations.

**Figure 4**
**Performance of SAA, CFS, TuRF and 2LOmb in the two-locus interaction problem**. The results are averaged over 25 independent simulations. Detection is declared for SAA and 2LOmb if the p-values used as detection indicators in their results are less than 0.05. The results from SAA, CFS, TuRF and 2LOmb are displayed using blue, green, red and black markers, respectively. In each chart, the horizontal axis represents the number of correctly-identified causative SNPs while the vertical axis represents the number of output SNPs reported by the algorithm. The charts on which the red markers are invisible denote the situations in which the performance of TuRF and 2LOmb is similar. The charts in this figure are displayed using a coarser scale than the charts in Figure 3.

**Figure 5**
**Performance of the exhaustive two-locus analysis and 2LOmb in the three-locus interaction problem**. The explanation for how the results are obtained and displayed is the same as that given in Figure 3.

**Figure 6**
**Performance of SAA, CFS, TuRF and 2LOmb in the three-locus interaction problem**. The explanation for how the results are obtained and displayed is the same as that given in Figure 4.

**Figure 7**
**Performance of the exhaustive two-locus analysis and 2LOmb in the four-locus interaction problem**. The explanation for how the results are obtained and displayed is the same as that given in Figure 3.

**Figure 8**
**Performance of SAA, CFS, TuRF and 2LOmb in the four-locus interaction problem**. The explanation for how the results are obtained and displayed is the same as that given in Figure 4.

**Figure 9**
**Prediction accuracy from the MDR analysis**. A 10-fold cross-validation strategy is applied during the accuracy evaluation. The best MDR model is located by exploring all possible SNP combinations. All erroneous SNPs, which are left over after the screening by 2LOmb, have been successfully identified. All MDR models contain the correct number of causative SNPs. In addition, the MDR cross-validation consistency is 10/10.

**Figure 10**
**Genotype distribution of two causative SNPs in a balanced case-control data set with the sample size of 800**. The left (black) bar in each cell represents the number of case samples while the right (white) bar represents the number of control samples. The cells with genotypes *AABB*, *AABb*, *AaBB*, *Aabb*, *aaBb* and *aabb* are labelled as protective genotypes while the cells with genotypes *AAbb*, *AaBb* and *aaBB* are labelled as disease-predisposing genotypes.

**Figure 11**
Linkage disequilibrium (LD) patterns of SNPs in *PGM1*, *LMX1A*, *PARK2* and *GYS2*. LD is explained via D' displayed in the upper triangle and r²displayed in the lower triangle. Dark colours indicate high values while pale colours indicate low values. Distances between SNPs are given in terms of the number of base pairs. SNP1 = rs2269241, SNP2 = rs2269239, SNP3 = rs3790857, SNP4 = rs2269238, SNP5 = rs2348250, SNP6 = rs6702087, SNP7 = rs1893551, SNP8 = rs6924502, SNP9 = rs6487236, SNP10 = rs1871142 and SNP11 = rs10770836.

**Figure 12**
**Interaction dendrogram produced from 11 SNPs that are chosen by 2LOmb**. The colours in the dendrogram comprise a spectrum of colours representing a transition from synergy to redundancy. Synergy denotes the situation in which the entropy-based interaction between two SNPs provides more information than the entropy-based correlation between the pair. Redundancy refers to the situation in which the entropy-based interaction between two SNPs provides less information than the entropy-based correlation between the pair [7].

**Figure 13**
**An MDR decision table that is constructed using a balanced case-control data set with the sample size of 800**. The genotype of each sample is determined from two SNPs. The table consists of nine cells where each cell represents a unique genotype. The left (black) bar in each cell represents the number of case samples while the right (white) bar represents the number of control samples. The cells with genotypes *AABB*, *AABb*, *AAbb*, *AaBB* and *aaBB* are labelled as protective genotypes while the cells with genotypes *AaBb*, *Aabb*, *aaBb* and *aabb* are labelled as disease-predisposing genotypes.

See this image and copyright information in PMC

Cited by

Assessing gene-gene interactions in pharmacogenomics.
Lane HY, Tsai GE, Lin E. Lane HY, et al. Mol Diagn Ther. 2012 Feb 1;16(1):15-27. doi: 10.1007/BF03256426. Mol Diagn Ther. 2012. PMID: 22352452 Review.
Alternative splicing generates different parkin protein isoforms: evidences in human, rat, and mouse brain.
Scuderi S, La Cognata V, Drago F, Cavallaro S, D'Agata V. Scuderi S, et al. Biomed Res Int. 2014;2014:690796. doi: 10.1155/2014/690796. Epub 2014 Jul 16. Biomed Res Int. 2014. PMID: 25136611 Free PMC article.
Increasing the Coding Potential of Genomes Through Alternative Splicing: The Case of PARK2 Gene.
La Cognata V, Iemmolo R, D'Agata V, Scuderi S, Drago F, Zappia M, Cavallaro S. La Cognata V, et al. Curr Genomics. 2014 Jun;15(3):203-16. doi: 10.2174/1389202915666140426003342. Curr Genomics. 2014. PMID: 24955028 Free PMC article.
Mining pure, strict epistatic interactions from high-dimensional datasets: ameliorating the curse of dimensionality.
Jiang X, Neapolitan RE. Jiang X, et al. PLoS One. 2012;7(10):e46771. doi: 10.1371/journal.pone.0046771. Epub 2012 Oct 12. PLoS One. 2012. PMID: 23071633 Free PMC article.
Inferring combinatorial association logic networks in multimodal genome-wide screens.
de Ridder J, Gerrits A, Bot J, de Haan G, Reinders M, Wessels L. de Ridder J, et al. Bioinformatics. 2010 Jun 15;26(12):i149-57. doi: 10.1093/bioinformatics/btq211. Bioinformatics. 2010. PMID: 20529900 Free PMC article.

See all "Cited by" articles

References

1. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. - DOI - PubMed
1. Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene × gene interactions in genome-wide association studies of human population data. Hum Hered. 2007;63:67–84. doi: 10.1159/000099179. - DOI - PubMed
1. The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
1. The GAIN Collaborative Research Group New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007;39:1045–1051. doi: 10.1038/ng2127. - DOI - PubMed
1. Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23. doi: 10.1186/1471-2156-7-23. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. - DOI - PubMed

[2] Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science. 1996;273:1516–1517. doi: 10.1126/science.273.5281.1516. - DOI - PubMed

[3] Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene × gene interactions in genome-wide association studies of human population data. Hum Hered. 2007;63:67–84. doi: 10.1159/000099179. - DOI - PubMed

[4] Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB. Detection of gene × gene interactions in genome-wide association studies of human population data. Hum Hered. 2007;63:67–84. doi: 10.1159/000099179. - DOI - PubMed

[5] The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed

[6] The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed

[7] The GAIN Collaborative Research Group New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007;39:1045–1051. doi: 10.1038/ng2127. - DOI - PubMed

[8] The GAIN Collaborative Research Group New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet. 2007;39:1045–1051. doi: 10.1038/ng2127. - DOI - PubMed

[9] Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23. doi: 10.1186/1471-2156-7-23. - DOI - PMC - PubMed

[10] Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23. doi: 10.1186/1471-2156-7-23. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

Affiliation

Detecting purely epistatic multi-locus interactions by an omnibus permutation test on ensembles of two-locus analyses

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous