Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 2;9(4):e93379.
doi: 10.1371/journal.pone.0093379. eCollection 2014.

Exploiting SNP correlations within random forest for genome-wide association studies

Affiliations

Exploiting SNP correlations within random forest for genome-wide association studies

Vincent Botta et al. PLoS One. .

Abstract

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. A closer look into a T-Tree test-node.
The group 1 is tested. Out of this group, three SNPs are exploited by the weak learner. In red (resp. green), probability of being a case (resp. control) estimated by the weak-learner.
Figure 2
Figure 2. Group and variable importances for the two novel candidate regions for Crohn's disease.
Regions 2p12 (top) and 7q31 (bottom), as found by T-Trees on formula image. First row: SNP and block importances. Second row: univariate (Fisher) p-values and haplotype p-values as derived from the case/control omnibus test with formula image degrees of freedom where formula image corresponds to the number of common haplotypes (a haplotype is said to be common if its frequency is greater than formula image in the population under study). Third row: number of haplotypes in each block. Bottom plot: ld pattern (formula image) in the regions.

References

    1. Balding DJ (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7: 781–91. - PubMed
    1. Mccarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369. - PubMed
    1. Bewick V, Cheek L, Ball J (2004) Statistics review 8: Qualitative data - tests of association. Critical Care 8: 46–53. - PMC - PubMed
    1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, et al. (2007) Plink: a tool set for whole- genome association and population-based linkage analyses. American journal of human genetics 81: 559–575. - PMC - PubMed
    1. Wang H, Misztal I, Aguilar I, Legarra A, Muir WM (2012) Genome-wide association mapping including phenotypes from relatives without genotypes. Genetics Research 94: 73–83. - PubMed

Publication types