Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 8:11:15.
doi: 10.1186/1471-2105-11-15.

Machine learning methods for metabolic pathway prediction

Affiliations

Machine learning methods for metabolic pathway prediction

Joseph M Dale et al. BMC Bioinformatics. .

Abstract

Background: A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism.

Results: To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways.

Conclusions: ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Procedure for applying machine learning methods to metabolic pathway prediction. Data from curated pathway/genome databases (PGDBs) are gathered into a "gold standard" collection. Features are defined using biological knowledge, and their values are computed for all pathways in the gold standard. The resulting dataset is split into training and test sets. Training data are used to perform feature selection and parameter estimation for multiple predictor types. Test data are used to evaluate the predictors. The predictor which performs best on the test set will be applied to data from newly sequenced and annotated genomes to perform metabolic network reconstruction.
Figure 2
Figure 2
Escherichia coli K-12 MG1655 pathway 5-aminoimidazole ribonucleotide biosynthesis II. This pathway is present in E. coli; PathoLogic excludes it while our machine learning methods consistently predict it to be present. See Table 10 for selected feature values for this pathway.

References

    1. Keseler I, Bonavides-Martinez C, Collado-Vides J, Gama-Castro S, Gunsalus R, Johnson DA, Krummenacker M, Nolan L, Paley S, Paulsen I, Peralta-Gil M, Santos-Zavaleta A, Shearer A, Karp P. EcoCyc: A comprehensive view of E. coli biology. Nuc Acids Res. 2009;37:D464–70. doi: 10.1093/nar/gkn751. - DOI - PMC - PubMed
    1. Feist A, Henry C, Reed J, Krummenacker M, Joyce A, Karp P, Broadbelt L, Hatzimanikatis V, Palsson B. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol Syst Biol. 2007;3:121–38. doi: 10.1038/msb4100155. - DOI - PMC - PubMed
    1. Paley S, Karp P. Evaluation of computational metabolic-pathway predictions for H. pylori. Bioinformatics. 2002;18(5):715–24. doi: 10.1093/bioinformatics/18.5.715. - DOI - PubMed
    1. Caspi R, Foerster H, Fulcher C, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee SY, Shearer A, Tissier C, Walk T, Zhang P, Karp PD. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nuc Acids Res. 2008;36:D623–31. doi: 10.1093/nar/gkm900. - DOI - PMC - PubMed
    1. Green M, Karp P. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics. 2004;5:76. doi: 10.1186/1471-2105-5-76. - DOI - PMC - PubMed

Publication types

LinkOut - more resources