Screening large-scale association study data: exploiting interactions using random forests

Kathryn L Lunetta¹, L Brooke Hayward, Jonathan Segal, Paul Van Eerdewegh

Affiliations

PMID: 15588316
PMCID: PMC545646
DOI: 10.1186/1471-2156-5-32

Screening large-scale association study data: exploiting interactions using random forests

Kathryn L Lunetta et al. BMC Genet. 2004.

. 2004 Dec 10:5:32.

doi: 10.1186/1471-2156-5-32.

Authors

Kathryn L Lunetta¹, L Brooke Hayward, Jonathan Segal, Paul Van Eerdewegh

Affiliation

¹ Oscient Pharmaceuticals, Inc, (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA. klunetta@bu.edu

PMID: 15588316
PMCID: PMC545646
DOI: 10.1186/1471-2156-5-32

Abstract

Background: Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.

Results: Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.

Conclusions: In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

PubMed Disclaimer

Figures

**Figure 1**
Proportion of replicates for which the most significant 1, 2, 3, and 4 SNPs are all rSNPs for K4S2N100 and K4S2N1000 analysis designs. Genetic models are listed on the plots. "RF" and "Fisher" refer to the random forest importance index Z_Tand the Fisher Exact test p-value. See text for notation description.

**Figure 2**
Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for K4S2N100 and K4S2N1000 analysis designs. Other notation as in Figure 1.

**Figure 3**
Proportion of replicates for which the most significant N SNPs are all rSNPs. H8M4 genetic model. Analysis designs include 96 noise SNPs; K and S are listed on the plots. Other notation as in Figure 1.

**Figure 4**
Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for H8M4 genetic model. Analysis designs include 96 noise SNPs; K and S are listed on the plots.

**Figure 5**
Distribution of difference in importance ZT between the top ranked rSNP and the top ranked nSNP (Dmax(ZT), and lowest ranked rSNP (Dmin(ZT)) and top ranked nSNP. Dmax(-log p) and Dmin(-log p): differences using -log10 p-value from the Fisher exact test. Beside each boxplot is the p-value for the test of whether the mean difference over the 100 replicates is significantly different from 0. Genetic models listed in plot. Analysis design: K4S2, with N100 and N1000 shown on plot.

See this image and copyright information in PMC

References

1. George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association. 1993;88:881–889.
1. Oh C, Ye KQ, He Q, Mendell NR. Locating disease genes using Bayesian variable selection with the Haseman-Elston method. BMC Genet. 2003;4:S69. doi: 10.1186/1471-2156-4-S1-S69. - DOI - PMC - PubMed
1. Suh YJ, Ye KQ, Mendell NR. A method for evaluating the results of Bayesian model selection: application to linkage analyses of attributes determined by two or more genes. Hum Hered. 2003;55:147–152. doi: 10.1159/000072320. - DOI - PubMed
1. Yi N, George V, Allison DB. Stochastic search variable selection for identifying multiple quantitative trait loci. Genetics. 2003;164:1129–1138. - PMC - PubMed
1. York TP, Eaves LJ. Common disease analysis using Multivariate Adaptive Regression Splines (MARS): Genetic Analysis Workshop 12 simulated sequence data. Genet Epidemiol. 2001;21 Suppl 1:S649–54. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Screening large-scale association study data: exploiting interactions using random forests

Affiliation

Screening large-scale association study data: exploiting interactions using random forests

Authors

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical