Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Oct 3;103(40):14865-70.
doi: 10.1073/pnas.0605152103. Epub 2006 Sep 21.

Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

Affiliations

Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals

David P Enot et al. Proc Natl Acad Sci U S A. .

Abstract

Powerful algorithms are required to deal with the dimensionality of metabolomics data. Although many achieve high classification accuracy, the models they generate have limited value unless it can be demonstrated that they are reproducible and statistically relevant to the biological problem under investigation. Random forest (RF) generates models, without any requirement for dimensionality reduction or feature selection, in which individual variables are ranked for significance and displayed in an explicit manner. In metabolome fingerprinting by mass spectrometry, each metabolite can be represented by signals at several m/z. Exploiting a prior understanding of expected biochemical differences between sample classes, we aimed to develop meaningful metrics relevant to the significance both of the overall RF model and individual, potentially explanatory, signals. Pair-wise comparison of related plant genotypes with strong phenotypic differences demonstrated that robust models are not only reproducible but also logically structured, highlighting correlated m/z derived from just a small number of explanatory metabolites reflecting the biological differences between sample classes. RF models were also generated by using groupings of samples known to be increasingly phenotypically similar. Although classification accuracy was often reasonable, we demonstrated reproducibly in both Arabidopsis and potato a performance threshold based on margin statistics beyond which such models showed little structure indicative of either generalizability or further biological interpretability. In a multiclass problem using 25 Arabidopsis genotypes, despite the complicating effects of ecotype background and secondary metabolome perturbations common to several mutations, the ranking of metabolome signals by RF provided scope for deeper interpretability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Determining the characteristics of robust RF models. (A) Variable importance score versus ranking in weak RF models comparing Arabidopsis ammonia transporter mutant lines and brassinosteroid synthesis antisense lines to progenitor ecotypes. (B) Ordered list of top ranking signals from data depicted in A; correlated variables (e.g., isotopes) are color-coded and variables shared between models in the same metaclass are boxed. Variable importance score versus ranking in stronger RF models comparing pair wise with progenitor genotypes in potato transgenic lines (C) and Arabidopsis mutants (E). (D and F) Top ranking signals (descending order) from a selection of models depicted in A and E; m/z representing correlated variables (e.g., isotopes, salt adducts, and common fragments) are shaded in both lists.
Fig. 2.
Fig. 2.
Relationship between margin and variable significance in metabolome fingerprint models. Overall model P value (log 10) when including increasing numbers of top ranking variables in RF Analysis of Arabidopsis (A) and potato (C) genotypes. Overall model margins when including variables with increasing P value in RF analysis of Arabidopsis (B) and potato (D) genotypes. A suggested significance threshold is indicated.
Fig. 3.
Fig. 3.
Metabolome modeling with larger multiple class problems. (A) Two-dimensional mapping of 25 Arabidopsis lines using Sammon nonlinear mapping. Control ecotypes are colored blue, and progenitor ecotypes of mutant lines are presented as squares. The ecotype background of mutant lines is depicted by color: red, LeO; yellow, C24; pink, Ws0; green, Columbia. The lines linking phenotypically related genotypes represent margins in pair-wise comparisons and are color coded as follows: black solid line, <0.1; yellow dotted line, 0.1–0.2; blue dashed line, 0.2–0.3. Margins >0.3 have been omitted for the representation. (B) Top ranking signals in common (color coded) between RF models representing pair-wise comparisons between selected defense related and UV sensitive genotypes and their progenitor ecotypes. (C) RF models comparing lesion mimic mutants (ls1 and ls5) with the progenitor genotype (Ws0) indicating the presence of many common signals (color coded). (D) A correlation analysis of variables contributing significantly (P = < 0.005) to models discriminating mutant lines ls1 and ls5 from the progenitor ecotype Ws0.

References

    1. Dunn WB, Bailey NJC, Johnson HE. Analyst. 2005;130:606–625. - PubMed
    1. Bino RJ, Hall RD, Fiehn O, Kopka J, Saito K, Draper J, Nikolau BJ, Mendes P, Roessner-Tunali U, Beale MH, et al. Trends Plant Sci. 2004;9:418–425. - PubMed
    1. Dunn WB, Overy S, Quick WP. Metabolomics. 2005;1:137–148.
    1. Kell DB, Darby RM, Draper J. Plant Physiol. 2001;126:943–951. - PMC - PubMed
    1. Somorjai RL, Dolenko B, Baumgartner R. Bioinformatics. 2003;12:1484–1491. - PubMed

Publication types

LinkOut - more resources