Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Jun 9:5:76.
doi: 10.1186/1471-2105-5-76.

A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases

Affiliations

A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases

Michelle L Green et al. BMC Bioinformatics. .

Abstract

Background: The PathoLogic program constructs Pathway/Genome databases by using a genome's annotation to predict the set of metabolic pathways present in an organism. PathoLogic determines the set of reactions composing those pathways from the enzymes annotated in the organism's genome. Most annotation efforts fail to assign function to 40-60% of sequences. In addition, large numbers of sequences may have non-specific annotations (e.g., thiolase family protein). Pathway holes occur when a genome appears to lack the enzymes needed to catalyze reactions in a pathway. If a protein has not been assigned a specific function during the annotation process, any reaction catalyzed by that protein will appear as a missing enzyme or pathway hole in a Pathway/Genome database.

Results: We have developed a method that efficiently combines homology and pathway-based evidence to identify candidates for filling pathway holes in Pathway/Genome databases. Our program not only identifies potential candidate sequences for pathway holes, but combines data from multiple, heterogeneous sources to assess the likelihood that a candidate has the required function. Our algorithm emulates the manual sequence annotation process, considering not only evidence from homology searches, but also considering evidence from genomic context (i.e., is the gene part of an operon?) and functional context (e.g., are there functionally-related genes nearby in the genome?) to determine the posterior belief that a candidate has the required function. The method can be applied across an entire metabolic pathway network and is generally applicable to any pathway database. The program uses a set of sequences encoding the required activity in other genomes to identify candidate proteins in the genome of interest, and then evaluates each candidate by using a simple Bayes classifier to determine the probability that the candidate has the desired function. We achieved 71% precision at a probability threshold of 0.9 during cross-validation using known reactions in computationally-predicted pathway databases. After applying our method to 513 pathway holes in 333 pathways from three Pathway/Genome databases, we increased the number of complete pathways by 42%. We made putative assignments to 46% of the holes, including annotation of 17 sequences of previously unknown function.

Conclusions: Our pathway hole filler can be used not only to increase the utility of Pathway/Genome databases to both experimental and computational researchers, but also to improve predictions of protein function.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example pathway created by PathoLogic for the Caulobacter cresentus PGDB, CauloCyc The enzymes for the quinolinate synthetase, nicotinate-nucleotide pyrophosphorylase, and NAD(+) synthetase reactions are known. The enzymes for the 1.4.3.-, nicotinate-nucleotide adenylyltransferase, and NAD(+) synthase (glutamine-hydrolyzing) reactions are missing.
Figure 2
Figure 2
Overall algorithm for filling pathway holes
Figure 3
Figure 3
A graphical representation of the data consolidation process
Figure 4
Figure 4
Network N0 includes all single-outline nodes plus both the solid and dashed arcs The simple Bayes classifier used by our program (N1) includes all nodes but excludes the dashed arcs. We simplified the network, excluding the dashed arcs to reduce the amount of data required to accurately construct the conditional probability distributions needed for the network.
Figure 5
Figure 5
Example of conditional probability distribution calculated from the candidates identified for the known reactions in CauloCyc This figure shows the probability distribution for the average-fraction-aligned node. The set of candidates for all known reactions in the PGDB was divided into two subsets – true hits (those candidates that are assigned to a particular reaction in the PGDB) and false hits (those candidates that are not assigned to a particular reaction in the PGDB). We partition the values into nonoverlapping bins and determine the frequency of candidates within each bin for the two sets of hits. These frequencies make up the conditional probability distributions used for our Bayesian network. For example, if the avg fraction of query aligned for a candidate is 0.84, the P(average-fraction-aligned = 0.84 | has-function) = 0.16 and P(average-fraction-aligned = 0.84 | ¬has-function) = 0.02.
Figure 6
Figure 6
True positives versus false positives for classification using E-value cutoff alone, and using the Bayes classifier model without E-valuesThe inset shows the fraction of true positives versus number of false positives as determined by model 1 for all three PGDBs evaluated.
Figure 7
Figure 7
Fraction of pathway holes filled as a function of probability threshold
Figure 8
Figure 8
Pyridine biosynthesis pathway with putative enzyme assignments identified by our program

References

    1. Benson Dennis A., Karsch-Mizrachi Ilene, Lipman David J., Ostell James, Wheeler David L. GenBank. Nucl Acids Res. 2003;31:23–27. doi: 10.1093/nar/gkg057. - DOI - PMC - PubMed
    1. Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci. 1996;12:95–107. - PubMed
    1. Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14:846–856. doi: 10.1093/bioinformatics/14.10.846. - DOI - PubMed
    1. Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994;235:1501–1531. doi: 10.1006/jmbi.1994.1104. - DOI - PubMed
    1. Karp PD, Paley S, Romero P. The Pathway Tools software. Bioinformatics. 2002;18:S225–32. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources