GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest
- PMID: 17767709
- PMCID: PMC2034606
- DOI: 10.1186/1471-2105-8-328
GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest
Abstract
Background: Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available.
Results: We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from http://genesrf2.bioinfo.cnio.es. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN.
Conclusion: varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.
Figures


Similar articles
-
GeneTools--application for functional annotation and statistical hypothesis testing.BMC Bioinformatics. 2006 Oct 24;7:470. doi: 10.1186/1471-2105-7-470. BMC Bioinformatics. 2006. PMID: 17062145 Free PMC article.
-
Array2BIO: from microarray expression data to functional annotation of co-regulated genes.BMC Bioinformatics. 2006 Jun 16;7:307. doi: 10.1186/1471-2105-7-307. BMC Bioinformatics. 2006. PMID: 16780584 Free PMC article.
-
Pathway analysis using random forests classification and regression.Bioinformatics. 2006 Aug 15;22(16):2028-36. doi: 10.1093/bioinformatics/btl344. Epub 2006 Jun 29. Bioinformatics. 2006. PMID: 16809386
-
Classification based upon gene expression data: bias and precision of error rates.Bioinformatics. 2007 Jun 1;23(11):1363-70. doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28. Bioinformatics. 2007. PMID: 17392326 Review.
-
Computational tools for the modern andrologist.J Androl. 1996 Sep-Oct;17(5):462-6. J Androl. 1996. PMID: 8957688 Review.
Cited by
-
Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes.J Clin Bioinforma. 2012 May 22;2(1):13. doi: 10.1186/2043-9113-2-13. J Clin Bioinforma. 2012. PMID: 22616791 Free PMC article.
-
High-accuracy prediction of colorectal cancer chemotherapy efficacy using machine learning applied to gene expression data.Front Physiol. 2024 Jan 18;14:1272206. doi: 10.3389/fphys.2023.1272206. eCollection 2023. Front Physiol. 2024. PMID: 38304289 Free PMC article.
-
The patterns of population differentiation in a Brassica rapa core collection.Theor Appl Genet. 2011 Apr;122(6):1105-18. doi: 10.1007/s00122-010-1516-1. Epub 2010 Dec 31. Theor Appl Genet. 2011. PMID: 21193901 Free PMC article.
-
DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data.BMC Bioinformatics. 2009 Jan 29;10:37. doi: 10.1186/1471-2105-10-37. BMC Bioinformatics. 2009. PMID: 19178723 Free PMC article.
-
A random forest based biomarker discovery and power analysis framework for diagnostics research.BMC Med Genomics. 2020 Nov 23;13(1):178. doi: 10.1186/s12920-020-00826-6. BMC Med Genomics. 2020. PMID: 33228632 Free PMC article.
References
-
- Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute. 2003;95:14–18. - PubMed
-
- Dudoit S, Fridlyand J. Classification in microarray experiments. In: Speed T, editor. Statistical analysis of gene expression microarray data. New York: Chapman & Hall; 2003. pp. 93–158.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources