. 2010 Jun 14:11:49.

doi: 10.1186/1471-2156-11-49.

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings

Benjamin A Goldstein¹, Alan E Hubbard, Adele Cutler, Lisa F Barcellos

Affiliations

PMID: 20546594
PMCID: PMC2896336
DOI: 10.1186/1471-2156-11-49

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings

Benjamin A Goldstein et al. BMC Genet. 2010.

. 2010 Jun 14:11:49.

doi: 10.1186/1471-2156-11-49.

Authors

Benjamin A Goldstein¹, Alan E Hubbard, Adele Cutler, Lisa F Barcellos

Affiliation

¹ Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA. bgoldstein@genepi.berkeley.edu

PMID: 20546594
PMCID: PMC2896336
DOI: 10.1186/1471-2156-11-49

Abstract

Background: As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.

Results: Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.

Conclusions: This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

PubMed Disclaimer

Figures

**Figure 1**
**Random Forests Algorithm**. The RF algorithm begins by selecting a bootstrap sample of the data (1). A random subset of the variables is selected (2) and searched over to find the optimal split (3). This is repeated until an unpruned CART tree is formed (4). The data not part of the bootstrap sample is run down the tree to derive the error rate and measures of VI (5). This is repeated until a full forest is grown (6).

**Figure 2**
**Analysis Flow**. Flow Plan for RF analysis. The full MS case-control dataset was analyzed, searching for the optimal *mtry* &*ntree*, along with sparsity pruning, as necessary. Two runs were then conducted, one without any 6p genotypes, and one with data for a single 6p SNP. Finally, LD pruning was explored. After the best data configuration was found, RF analysis was re-run to examine stability of results. The final RF results were compared to the original GWA results [19].

**Figure 3**
**Scree Plots for top 100 RF VI measures**. The three plots represent the VI measures for the full dataset with chromosome 6p data removed, the R ²= 0.99 run and the R ²= 0.90 run. An "elbow" is present in all three plots around 25 markers (designated with the vertical line).

**Figure 4**
**Convergence of Error Rate Across Different mtrys**. An examination of the error-rate across different *mtrys*. The larger *mtrys* of .1p and above clearly lead to a much lower error rate than the more traditional lower values. .1p seems to minimize the overall OOB error-rate though not by much. Convergence seems to occur around 200 - 400 trees.

**Figure 5**
**Sparsity of SNPs across mtry**. As expected, sparsity increases as a function of *mtry*. There is the most dramatic increase after moving from an *mtry* of .5p to p.

**Figure 6**
**Error Rate Across LD Prunes**. In the red line we see the OOB error rate across the different LD prunes. There is little information lost going from the full data to pruning at 99% and even 90%. Thereafter there is more loss of information. The blue line shows the number of SNPs that were in each RF analysis.

See this image and copyright information in PMC

Cited by

Detection of Hereditary 1,25-Hydroxyvitamin D-Resistant Rickets Caused by Uniparental Disomy of Chromosome 12 Using Genome-Wide Single Nucleotide Polymorphism Array.
Tamura M, Isojima T, Kawashima M, Yoshida H, Yamamoto K, Kitaoka T, Namba N, Oka A, Ozono K, Tokunaga K, Kitanaka S. Tamura M, et al. PLoS One. 2015 Jul 8;10(7):e0131157. doi: 10.1371/journal.pone.0131157. eCollection 2015. PLoS One. 2015. PMID: 26153892 Free PMC article.
On the overestimation of random forest's out-of-bag error.
Janitza S, Hornung R. Janitza S, et al. PLoS One. 2018 Aug 6;13(8):e0201904. doi: 10.1371/journal.pone.0201904. eCollection 2018. PLoS One. 2018. PMID: 30080866 Free PMC article.
Targeted Metabolomics Analysis Suggests That Tacrolimus Alters Protection against Oxidative Stress.
Joncquel M, Labasque J, Demaret J, Bout MA, Hamroun A, Hennart B, Tronchon M, Defevre M, Kim I, Kerckhove A, George L, Gilleron M, Dessein AF, Zerimech F, Grzych G. Joncquel M, et al. Antioxidants (Basel). 2023 Jul 12;12(7):1412. doi: 10.3390/antiox12071412. Antioxidants (Basel). 2023. PMID: 37507951 Free PMC article.
KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis.
Qin X, Chiang CWK, Gaggiotti OE. Qin X, et al. Brief Bioinform. 2022 Jul 18;23(4):bbac202. doi: 10.1093/bib/bbac202. Brief Bioinform. 2022. PMID: 35649387 Free PMC article.
Exploiting SNP correlations within random forest for genome-wide association studies.
Botta V, Louppe G, Geurts P, Wehenkel L. Botta V, et al. PLoS One. 2014 Apr 2;9(4):e93379. doi: 10.1371/journal.pone.0093379. eCollection 2014. PLoS One. 2014. PMID: 24695491 Free PMC article.

See all "Cited by" articles

References

1. WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. - DOI - PMC - PubMed
1. Heidema AG, Boer JM, Nagelkerke N, Mariman EC, van der A DL, Feskens EJ. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23. doi: 10.1186/1471-2156-7-23. - DOI - PMC - PubMed
1. Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28:157–170. doi: 10.1002/gepi.20042. - DOI - PubMed
1. Motsinger AA, Ritchie MD. Multifactor dimensionality reduction: an analysis strategy for modelling and detecting gene-gene interactions in human genetics and pharmacogenomics studies. Hum Genomics. 2006;2:318–328. - PMC - PubMed
1. Yoon Y, Song J, Hong S, Kim J. Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines. Clin Chem Lab Med. 2003;41:529–534. doi: 10.1515/CCLM.2003.080. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings

Affiliation

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous