. 2021 Apr 7;22(1):180.

doi: 10.1186/s12859-021-04107-6.

Optimized permutation testing for information theoretic measures of multi-gene interactions

James M Kunert-Graf¹, Nikita A Sakhanenko², David J Galas²

Affiliations

¹ Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA. jkunert@pnri.org.
² Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.

PMID: 33827420
PMCID: PMC8028212
DOI: 10.1186/s12859-021-04107-6

Optimized permutation testing for information theoretic measures of multi-gene interactions

James M Kunert-Graf et al. BMC Bioinformatics. 2021.

. 2021 Apr 7;22(1):180.

doi: 10.1186/s12859-021-04107-6.

Authors

James M Kunert-Graf¹, Nikita A Sakhanenko², David J Galas²

Affiliations

¹ Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA. jkunert@pnri.org.
² Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.

PMID: 33827420
PMCID: PMC8028212
DOI: 10.1186/s12859-021-04107-6

Abstract

Background: Permutation testing is often considered the "gold standard" for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large.

Results: In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP-SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 10³ for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples.

Conclusions: The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts .

Keywords: Information theory; Multi-locus GWAS; Multivariable interactions; Permutation testing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
a Computation time as a function of the number of permutations $N_{p}$ , for a synthetic dataset with a fixed number of individuals $n = 10, 000$ and 100 k SNPs. Both direct permutations (in blue) and our method (in orange) are $O (N_{p})$ (note that the horizontal axis is logarithmic, and the best fit lines plotted here are indeed linear). Our method is faster by a factor of over $10^{3}$ per permutation. b Computation time as a function of the number of individuals n, for a synthetic dataset with a fixed number of permutations $N_{p} = 20$ and 100 k SNPs. Direct permutation is $O (n)$ but our approach is $O (1)$ (i.e. computation time does not depend on the number of samples for this approach)

**Fig. 2**
Using the simulated data described in Sect. , we generated 1,000,000 permuted count tables using both the naive method of directly permuting the phenotype labels and using our approach. The distributions of the count table elements $c_{i j 0}^{*}$ are plotted here, with the direct permutation result shown in blue and our method shown in red. The plot consists almost entirely of the purple overlapping region, as there is almost no visible difference between the distributions

**Fig. 3**
The permuted count tables from Sect. can be used to calculate the joint entropies, from which we can calculate any information theoretic measure which is a function of the entropies. Here we calculate the multi-information $Ω$ using both the count tables generated by direct permutations and by our method, with the resulting distributions being nearly identical

**Fig. 4**
Quantile–quantile plot of p values from Epps–Singleton tests comparing the two distributions. The null hypothesis is that the two distributions are identical. Under the null hypothesis, the p value is uniformly distributed and we would expect the Q–Q plot to be linear along the diagonal, which is what we observe. The count table distributions generated from each method are indistinguishable via this test

See this image and copyright information in PMC

References

1. Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86:6–22. doi: 10.1016/j.ajhg.2009.11.017. - DOI - PMC - PubMed
1. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: genetic interactions create phantom heritability. Proc Nat Acad Sci. 2012;109(4):1193–1198. doi: 10.1073/pnas.1119675109. - DOI - PMC - PubMed
1. Ferrario PG, König IR. Transferring entropy to the realm of GxG interactions. Briefings Bioinf. 2016;19(1):136–147. doi: 10.1093/bib/bbw086. - DOI - PMC - PubMed
1. Cordell HJ. Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404. doi: 10.1038/nrg2579. - DOI - PMC - PubMed
1. Purcell S, Neale B, Brown T-K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. Plink: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U01HL126496/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimized permutation testing for information theoretic measures of multi-gene interactions

Affiliations

Optimized permutation testing for information theoretic measures of multi-gene interactions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources