Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec;16(12):1705-18.
doi: 10.1089/cmb.2008.0037.

Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies

Affiliations

Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies

Andrei S Rodin et al. J Comput Biol. 2009 Dec.

Abstract

Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
Multistep data analysis strategy.
FIG. 2.
FIG. 2.
(A) Application of SIZEFIT algorithm to 537 African-Americans in ARIC dataset. See text for detail. Note the nonlinear scale (for higher resolution in the more important “under-450-SNPs” area). Up to 4,870 predictive variables (4,869 SNPs and sex) were fed into the RF to predict CHD case (266)/control (271) status. OOB error is marked by circles, control OOB error is marked by triangles. (B) Application of SIZEFIT algorithm to 1,396 non-Hispanic whites in ARIC dataset. See text for detail. Note the nonlinear scale (for higher resolution in the more important under-450-SNPs area). Up to 4,811 predictive variables (4,810 SNPs and sex) were fed into the RF to predict CHD case (897)/control (499) status. OOB error is marked by circles, control OOB error is marked by triangles. ARIC, Atherosclerosis Risk in Communities; SNPs, single-nucleotide polymorphisms; RF, Random Forests; CHD, coronary heart disease; OOB, out-of-bag.
FIG. 3.
FIG. 3.
Application of SIZEFIT algorithm to 537 African-Americans in ARIC dataset using fivefold cross-validation as a generalization error estimate. Up to 4,870 predictive variables (4,869 SNPs and sex) were fed into the RF to predict CHD case (266)/control (271) status. ARIC, Atherosclerosis Risk in Communities; SNPs, single-nucleotide polymorphisms; RF, Random Forests; CHD, coronary heart disease.
FIG. 4.
FIG. 4.
Convergence of FIXFIT algorithm. See text for details. One cycle corresponds to 10 kernel recalculations (seconds to minutes of CPU time).
FIG. 5.
FIG. 5.
Application of SUMSTAT algorithm to 537 African-Americans in ARIC dataset. See text for detail. Note that the curve reaches nadir at 90 SNPs. Up to 4,869 SNPs were used to predict CHD case (266)/control (271) status. ARIC, Atherosclerosis Risk in Communities; SNPs, single nucleotide polymorphisms; CHD, coronary heart disease.

References

    1. Ambroise C. McLachlan G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. U.S.A. 2002;99:6562–6566. - PMC - PubMed
    1. ARIC Investigators. The Atherosclerosis Risk in Communities (ARIC) study: design and objectives. Am. J. Epidemiol. 1989;129:687–702. - PubMed
    1. Boettcher S. Percus A.G. Nature's way of optimizing. Artif. Intell. 2000;119:275–286.
    1. Braga-Neto U. Hashimoto R. Dougherty E.R., et al. Is cross-validation better than resubstitution for ranking genes? Bioinformatics. 2004;20:253–258. - PubMed
    1. Breiman L. Random Forests. Mach. Learn. 2001;45:5–32.

Publication types

Substances

LinkOut - more resources