Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007;8(9):R187.
doi: 10.1186/gb-2007-8-9-r187.

The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data

Affiliations

The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data

Gabriel S Eichler et al. Genome Biol. 2007.

Abstract

Interpretation of microarray data remains a challenge, and most methods fail to consider the complex, nonlinear regulation of gene expression. To address that limitation, we introduce Learner of Functional Enrichment (LeFE), a statistical/machine learning algorithm based on Random Forest, and demonstrate it on several diverse datasets: smoker/never smoker, breast cancer classification, and cancer drug sensitivity. We also compare it with previously published algorithms, including Gene Set Enrichment Analysis. LeFE regularly identifies statistically significant functional themes consistent with known biology.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The LeFE algorithm illustrated schematically for a category of two genes. See Materials and methods for further details and Table 4 for a description of the steps (keyed to the circled letters). LeFE, Learner of Functional Enrichment.
Figure 2
Figure 2
Importance plots (probability density distributions) of gene importance scores calculated by LeFE: smoker versus nonsmoker dataset. Shown are representative distributions for three gene categories (red curves) and their corresponding negative control gene sets (black curves). The curves were smoothed according to default settings of the 'density' function in R. The shifted secondary peaks, denoted by red arrows, for aldehyde metabolism and glutathione metabolism reflect genes important to the Random Forest models. The viral life cycle category contains no secondary peaks and therefore does not appear to be associated with smoking. See Results for further details.
Figure 3
Figure 3
A Comparison of LeFE with PathwayRF Shown is a comparison of Learner of Functional Enrichment (LeFE) and PathwayRF with respect to the size distribution of categories identified as important for breast cancer classification using the Gene Ontology (GO) biological process categories. (a) Scatter plots showing category rank versus category size. Ties in category ranks were resolved through random reordering. Red lines are lowess regressions. (b) Comparison of GO superset and subset ranks. Almost all points for PathwayRF are below the blue x = y line, indicating that supersets rank lower (better) than that their corresponding subsets. The panel for LeFE shows no such bias. (c) The GO biological process hierarchy (with the most general categories toward the top). Blue circles denote the top 25 categories ranked by PathwayRF; red circles denote the same for LeFE; and yellow circles denote categories in the top 25 for both algorithms. The mean GO level is 4.92 for PathwayRF and 7.08 for LeFE. There are no cases in which LeFE's top results are the ancestors of top results from PathwayRF. However, the black edges highlight eight cases in which LeFE found categories that are progeny of categories identified by PathwayRF.
Figure 4
Figure 4
Replicate applications of LeFE to the breast cancer classification dataset. Scatter plot comparing the ranks resulting from two applications of Learner of Functional Enrichment (LeFE) to the breast cancer classification dataset, with nr = 75, nc = 6, and nTree = 400. The inset represents a blowup of the top 50 categories. r denotes the Pearson's correlation coefficient of the ranks (the Spearman correlation coefficient).

References

    1. Kanehisa M. A database for post-genome analysis. Trends Genet. 1997;13:375–376. doi: 10.1016/S0168-9525(97)01223-7. - DOI - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. - DOI - PMC - PubMed
    1. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003;4:R28. doi: 10.1186/gb-2003-4-4-r28. - DOI - PMC - PubMed
    1. Zeeberg BR, Qin H, Narasimhan S, Sunshine M, Cao H, Kane DW, Reimers M, Stephens RM, Bryant D, Burt SK, et al. High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of common variable immune deficiency (CVID). BMC Bioinformatics. 2005;6:168. doi: 10.1186/1471-2105-6-168. - DOI - PMC - PubMed
    1. Khatri P, Draghici S, Ostermeier GC, Krawetz SA. Profiling gene expression using onto-express. Genomics. 2002;79:266–270. doi: 10.1006/geno.2002.6698. - DOI - PubMed

Publication types