Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 6:9:14.
doi: 10.1186/s13040-016-0093-5. eCollection 2016.

Detecting gene-gene interactions using a permutation-based random forest method

Affiliations

Detecting gene-gene interactions using a permutation-based random forest method

Jing Li et al. BioData Min. .

Abstract

Background: Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions.

Results: We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions.

Conclusions: We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.

Keywords: GWAS; Machine learning; Random forest; Scale invariant.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overview of the permuted Random Forest (pRF). Shown in panel a is the original dataset with all the SNP information (0, 1 or 2) and class (cases-control status). Each row represents a sample; different three colors in the SNP columns indicate different genotypes, and two colors in the class column indicate case-control status. b shows the first permutation framework that keeps SNPs’ main effects, in which cases and controls are separated, two selected SNP columns shuffle the information separately within each class. c shows the second permutation framework that keeps SNPs’ interaction and main effects, in which cases and controls are separated, two selected SNPs shuffle their information together by keeping their genotype combinations, separately within each class. RF is trained using original dataset and tested using the datasets from the above two permutation schemes. Error rates are calculated by averaging the classification errors across all samples. The same process is repeated 10 times and the error rates are averaged from 10 permutation results. The average classification error from the first permutation framework is named E1, while the average classification error from the second permutation framework is named E2. The whole process is repeated on all pairs of SNPs and the difference in average error rates (Δ E = E1 - E2) are calculated and ranked to identify the top candidates
Fig. 2
Fig. 2
Statistical epistasis network (SEN) and permuted random forest networks (PRFN). a shows the largest connected components from statistical epistasis network, which includes 39 SNPs. The largest connected components were divided into three clusters. Permuted Random Forest (pRF) was applied using the SNPs within each of the three clusters separately. b, c and d show the PRFNs built from each cluster. The width of the edges are in proportion to how strong the interactions exist, which are represented by the differences in error rates using our method. The cut-off for the SEN was based on entropy value of 0.013. PRFNs were built using same numbers of edges as in each cluster in SEN
Fig. 3
Fig. 3
Characterization of newly identified interacted SNP pairs using GIANT. Network filters were set as minimum relationship confidence 0.8 and maximum number of genes 5. Interactions between genes CCL5 and PARP4, MBD2 and GSTM, BCL6 and XPC were characterized using GIANT and the results were shown in panel (a, b and c). a shows the network of CCL5 and PARP4; b shows MBD2 and GSTM3; c shows BCL6 and XPC

References

    1. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005;6(2):95–108. doi: 10.1038/nrg1521. - DOI - PubMed
    1. Wang WYS, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet. 2005;6(2):109–18. doi: 10.1038/nrg1522. - DOI - PubMed
    1. Manolio TA. Genomewide Association Studies and Assessment of the Risk of Disease. N Engl J Med. 2010;363(2):166–76. doi: 10.1056/NEJMra0905980. - DOI - PubMed
    1. Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinforma. 2010;26(4):445–55. doi: 10.1093/bioinformatics/btp713. - DOI - PMC - PubMed
    1. Barsh GS, Copenhaver GP, Gibson G, Williams SM. Guidelines for genome-wide association studies. PLOS Genet. 2012;8(7):e1002812. doi: 10.1371/journal.pgen.1002812. - DOI - PMC - PubMed

LinkOut - more resources