. 2018 Jan 29:7:e31097.

doi: 10.7554/eLife.31097.

Prediction of enzymatic pathways by integrative pathway mapping

Sara Calhoun^#¹, Magdalena Korczynska^#², Daniel J Wichelecki^{3

4

5}, Brian San Francisco³, Suwen Zhao², Dmitry A Rodionov^{6

7}, Matthew W Vetting⁸, Nawar F Al-Obaidi⁸, Henry Lin², Matthew J O'Meara², David A Scott⁶, John H Morris⁹, Daniel Russel¹, Steven C Almo⁸, Andrei L Osterman⁶, John A Gerlt^{3

4

5}, Matthew P Jacobson², Brian K Shoichet², Andrej Sali^{1

2

10}

Affiliations

¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, United States.
² Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States.
³ Institute for Genomic Biology, University of Illinois, Urbana, United States.
⁴ Department of Biochemistry, University of Illinois, Urbana, United States.
⁵ Department of Chemistry, University of Illinois, Urbana, United States.
⁶ Sanford Burnham Prebys Medical Discovery Institute, La Jolla, United States.
⁷ A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.
⁸ Department of Biochemistry, Albert Einstein College of Medicine, New York, United States.
⁹ Resource for Biocomputing, Visualization and Informatics, Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States.
¹⁰ California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, United States.

^# Contributed equally.

PMID: 29377793
PMCID: PMC5788505
DOI: 10.7554/eLife.31097

Prediction of enzymatic pathways by integrative pathway mapping

Sara Calhoun et al. Elife. 2018.

. 2018 Jan 29:7:e31097.

doi: 10.7554/eLife.31097.

Authors

Affiliations

¹ Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, United States.
² Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States.
³ Institute for Genomic Biology, University of Illinois, Urbana, United States.
⁴ Department of Biochemistry, University of Illinois, Urbana, United States.
⁵ Department of Chemistry, University of Illinois, Urbana, United States.
⁶ Sanford Burnham Prebys Medical Discovery Institute, La Jolla, United States.
⁷ A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.
⁸ Department of Biochemistry, Albert Einstein College of Medicine, New York, United States.
⁹ Resource for Biocomputing, Visualization and Informatics, Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States.
¹⁰ California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, United States.

^# Contributed equally.

PMID: 29377793
PMCID: PMC5788505
DOI: 10.7554/eLife.31097

Abstract

The functions of most proteins are yet to be determined. The function of an enzyme is often defined by its interacting partners, including its substrate and product, and its role in larger metabolic networks. Here, we describe a computational method that predicts the functions of orphan enzymes by organizing them into a linear metabolic pathway. Given candidate enzyme and metabolite pathway members, this aim is achieved by finding those pathways that satisfy structural and network restraints implied by varied input information, including that from virtual screening, chemoinformatics, genomic context analysis, and ligand -binding experiments. We demonstrate this integrative pathway mapping method by predicting the L-gulonate catabolic pathway in Haemophilus influenzae Rd KW20. The prediction was subsequently validated experimentally by enzymology, crystallography, and metabolomics. Integrative pathway mapping by satisfaction of structural and network restraints is extensible to molecular networks in general and thus formally bridges the gap between structural biology and systems biology.

Keywords: biophysics; computational biology; enzyme function annotation; integrative pathway mapping; l-gulonate catabolic pathway; none; pathway prediction; structural biology; structure based pathway discovery; systems biology.

PubMed Disclaimer

Conflict of interest statement

SC, MK, DW, BS, SZ, DR, MV, NA, HL, MO, DS, JM, DR, SA, AO, JG, BS, AS No competing interests declared, MJ Consultant to and stockholder of Schrodinger LLC, which licenses, develops, and distributes some of the software used in this work

Figures

**Figure 1.. Overview of integrative pathway mapping method.**
The four stages of integrative modeling are: (1) Gathering information, (2) Designing model representation and evaluation, (3) Sampling good models, and (4) Analyzing models and information. (1) Here, the input information is gathered from seven different sources used to determine the candidate proteins, such as co-localization and conservation in the genome neighborhood, and the scoring restraints (docking scores from virtual screening, chemical transformations, ensemble similarity calculations of virtual screening hits from similarity ensemble approach, DSF screening hits, metabolic endpoints, and characterized chemical reactions). (2) A pathway model is represented as a graph composed of protein and ligand nodes. Proteins are depicted as diamonds and ligands are depicted as circles, with lines showing the node patterns evaluated by a given type of information. (3) The combinatorial optimization problem is solved by Monte Carlo simulated annealing sampling, consisting of randomly swapping nodes in and out of the pathway model and rearranging the edges between the nodes. (4) The final analysis stage involves assessing the sampling, precision, and accuracy of the models.

**Figure 1—figure supplement 2.. Pfam genome neighborhood network (GNN).**
Five enzyme families are extracted from the Pfam GNN, they are identified by cluster 223 in the SSN indicated by red circles. The Pfam families include; (A) alcohol dehydrogenases, (B) short chain dehydrogenases, (C) UxuA family sugar dehydratases, (D) pfkB family carbohydrate kinases, and (E) aldolases.

**Figure 1—figure supplement 3.. NetIMP cytoscape application for pathway model visualization.**
(A) Cytoscape app loads in good-scoring pathway models and displays them as a network built from the union of edges present in the ensemble of models. The automated yFiles hierarchic layout was applied to the network. The thickness of the edge represents the frequency that the edge appears in the ensemble. (B) The slider in the Results Panel can adjust the score cutoff for the models included in the network. In this view, the automated yFiles hierarchic layout is reapplied and singleton nodes are hidden for clarity. (C) An individual model is selected in the Results Panel, and the nodes and edges in the individual model are highlighted in the model’s unique color (in blue, here) on the network. The restraints are represented by the hatched edges connecting nodes corresponding to the restraints. Restraints that are violated in the mode are colored red.

**Figure 2.. Representation of alternative models obtained based on consistency with input information provided for the glycolysis benchmark pathway.**
(A) Example of three alternative models evaluated using different types of restraints based on modeling of the glycolysis pathway with a subset of pathways shown. The restraints on node patterns are shown using colored lines (blue – docking restraints, green – SEA restraints, purple – chemical transformation restraints, red – restraints with unfavorable scores). Metabolites are labeled by KEGG ID and enzymes are labeled by step in glycolysis pathway. On the left, alternate model one is consistent with docking scores, but not with all SEA scores and chemical transformations. In the middle, alternate model two is consistent with the docking scores and SEA scores, but not with chemical transformations. On the right, alternate model three is consistent with docking scores, SEA scores, and chemical transformations, thus increasing the rank of the correct enzyme-substrate pairings. (B) Alternative models shown with chemical structures. (C), Ranks of correct substrate for the corresponding enzyme at each step in the glycolysis benchmark case. 1 – glucokinase, 2 – phosphoglucose isomerase, 3 – phosphofructokinase, 4 – fructose bisphosphate aldolase, 5 – triosephosphate isomerase, 6 – glyceraldehyde 3-phosphate dehydrogenase, 7 – phosphoglycerate 8 – phosphoglycerate mutase 9 – enolase and 10 – pyruvate kinase.

**Figure 2—figure supplement 1.. Benchmark assessment for decoy and dummy enzymes.**
(A) Two decoy enzymes were included with the four enzymes in the CMP KDO-8P biosynthesis pathway, with lengths from three to six enzymes sampled. Pathway models with the best score for different number of protein pathway members. For comparison, the correct pathway is outlined in green. (B) Scores of pathway models at different pathway lengths. The top-scoring models consisted of the four known CMP KDO-8P pathway enzymes. The best score at each pathway length is shown as a blue circle, the score of the pathway model that matches the correct pathway is shown as a green triangle, and all other scores as black dots. The cutoff for good-scoring models, which is two standard deviations below the best score, is shown as a red dashed line. (C) Enzymes in the same pathway often cluster together in dendrograms constructed based on the SEA score as a distance metric. Such clustering is illustrated for glycolysis here. (D) Assessments of the three benchmark pathways for which the candidate enzyme set is incomplete. In each pathway, one enzyme is replaced with a ‘dummy’ enzyme, for which there is a lack of input information. For serine biosynthesis, the correct pathway remained the top-scoring model. For the other cases, the inclusion of the dummy enzyme lowered the correct pathway ranking, which, nevertheless, remained within the top-scoring models.

**Figure 3.. 12 representative predictions of the L-gulonate TRAP-SBP catabolic pathway.**
(A) 12 representative pathway models of TRAP SBP pathway predictions ordered by score, starting from the top with the best-scored prediction. The scores of the representative pathways are listed to the right of the corresponding pathway. Pathway enzymes are labeled by numbers as follows: 1 – HiGulD, 2 – HiUxuB, 3 – HiUxuA, 4 – HiKdgK, 5 – HiKdgA. (B) Graphical representation of an ensemble of representative pathway models. The predicted components in the ensemble of pathway models at each position are vertically aligned to the corresponding position in the gray pathway on the top. Ligand components are shown as circle nodes with the color corresponding to the ligand identity. Chemical structures are shown in Figure 3—figure supplement 2. Pathway enzymes are shown as diamond nodes with the same numbering as above. Edges are colored by individual pathway model prediction. The validated prediction is shown by black edges, enzyme nodes are colored black, and substrate/product nodes are outlined in black.

**Figure 3—figure supplement 1.. Sampling convergence test.**
Independent Monte Carlo sampling runs were performed, and the number of clusters of similar pathways for each number of runs was computed. (A) Glycolysis pathway, (B) CMP KDO-CMP biosynthesis pathway, (C) Serine synthesis pathway, and (D) L-gulonate catabolism pathway. (E) Probability of acceptance at the MC step in a sampling run, where D is the difference between the scores of the current pathway model and the new pathway model.

**Figure 3—figure supplement 2.. Chemical structures for top scoring pathway model predictions.**
The colored nodes correspond to the coloring in **Figure 3.**.

**Figure 4.. Catabolic pathway of *H. influenzae* Rd KW20.**
(A) The best-scoring pathway identified using the integrative mapping approach is annotated with experimental evidence: enzyme activity (blue), fitness growth determinants (red), transcript analyses on L-gulonate media (orange), atomic structure (green), and isotopic metabolic labeling (purple). The pathway demonstrates L-gulonate degradation into glyceraldehyde 3-phosphate and pyruvate. Bonds undergoing changes in the subsequent steps are colored in red. (B) Kinetics of pathway enzymes on predicted substrates. (C) Crystal structure of L-gulonate bound to SBP TRAP (PDB ID: 4PBQ). (D) Knockout growth assays of *H. influenzae* strains, ΔGulP (gulonate transporter periplasmic subunit) and ΔGulD (L-gulonate dehydrogenase), when grown on D-glucose vs. L-gulonate as a sole carbon source. (E) Fold change in expression for each gene when grown on the indicated carbon source, relative to growth on glucose. Error bars indicate one standard deviation for three biological replicates.

**Figure 4—figure supplement 1.. Isotopic labeling of L-gulonate as sole carbon source.**
(A) Time-dependent labeling of central metabolites utilizing 50% U-¹³C-L-gulonate as sole carbon source. 3 PG (3-phosphoglycerate). PEP (phosphoenolpyruvate). (B) Time-dependent labeling of fructuronate utilizing 50% U-¹³C-L-gulonate as sole carbon source. (C) The catalysis of L-gulonate (orange) enters central metabolism (blue) of *Hemophilus influenzae* str. Rd KW20 at glyceraldehyde 3-phosphate and pyruvate. This metabolic network also outlines the truncated TCA cycle found in *Hemophilus influenza* (Othman et al., 2014). Metabolites detectable via GC-MS are marked with an asterisks (*).

**Figure 4—figure supplement 2.. Comparative genomic reconstruction of L-gulonate and related uronic acid catabolic pathways and regulons in gammaproteobacteria.**
(A) Genomic context of L-gulonate, D-glucuronate, and L-galactonate utilization genes and regulons in 45 bacteria from the Pasteurellales lineage. The L-gulonate catabolic pathway from *H. influenzae* Rd KW20 was projected onto the pathogenic gammaproteobacteria, using the subsystems approach in the SEED genomic platform (Overbeek et al., 2005). Analysis of the adjacent metabolic pathways for use of hexuronates (D-glucuronate, D-galacturonate) and another 6-carbon aldonic acid (L-galactonate) revealed that the signature enzymes for the L-gulonate and the D-glucuronate (GlcA) pathways are the L-gulonate dehydrogenase; the GulD and the GlcA isomerases; and UxaC, respectively. Predicted sugar utilization phenotypes are given in parentheses. Among 21 analyzed strains of *H. influenzae,* 17 strains possess the uronate catabolic genes. Genes with the same functional roles are marked in matching colors. Pairwise similarity between orthologs in different *H. influenzae* strains is highlighted by light green (>95% protein identity), yellow (>50%), and pink (<50%). Tripartite TRAP transporters are shown by magenta arrows. Experimentally defined specificities of TRAP SBPs are indicated for *gulP/uxuP/lgoP* (LGul, L-gulonate, GlcA, D-glucuronate, LGal, L-galactonate), these genes are outlined in red and their locus tags are given below. Transcriptional regulators are shown by black arrows. Predicted DNA-binding sites of GulR and UxuR regulators are shown by black pins, and their common DNA motif is shown as a logo. For reconstruction of UxaR and GulR regulons, we used an established comparative genomics approach, based on identification of candidate regulator-binding sites using RegPredict tool (Novichkov et al., 2010). LGul pathway gene organization is conserved in 8 out of 21 *hr. influenza* strains into a single genetic locus containing two divergently transcribed operons, *gulD-gulPQM-kdgK-uxuB-kdgA* and *gulR-uxuA*, that are controlled by candidate GulR-binding sites in their common promoter region. Glucuronide hydrolases from different families are shown by white arrows. (B) Genomic context of D-glucuronate and D-galacturonate utilization genes and regulons in *E. coli* K-12. (C) Reconstructed metabolic pathways for utilization of hexuronic and aldonic acids. Solid and broken arrows indicate enzymatic reactions and transport, respectively. The TRAP and MFS transporter families are indicated by red stars and squares, respectively. (D) Genomic context of GulD orthologs in Enterobacteria. The uncharacterized genes *gulT* and *gulR-II* encode a novel MFS-transporter and a GntR-family regulator that are likely specific to L-gulonate. (E) Phylogenetic tree for selected zinc-dependent dehydrogenases from the COG1063 family. The analyzed GulD orthologs from the Pasteurellales and Enterobacteria are 62% identical to each other, while within the Pasteurellales lineage all GulDs are >70 identical. The previously characterized L-gulonate-specific dehydrogenases from *E. coli* K-12 and *Halomonas* spp. (Wichelecki et al., 2014) belong to the RspB branch, which is distinct from the GulD branch established in this work.

See this image and copyright information in PMC

References

1. Alber F, Dokudovskaya S, Veenhoff LM, Zhang W, Kipper J, Devos D, Suprapto A, Karni-Schmidt O, Williams R, Chait BT, Rout MP, Sali A. Determining the architectures of macromolecular assemblies. Nature. 2007;450:683–694. doi: 10.1038/nature06404. - DOI - PubMed
1. Aslanidis C, de Jong PJ. Ligation-independent cloning of PCR products (LIC-PCR) Nucleic Acids Research. 1990;18:6069–6074. doi: 10.1093/nar/18.20.6069. - DOI - PMC - PubMed
1. Barber AE, Babbitt PC. Pythoscape: a framework for generation of large protein similarity networks. Bioinformatics. 2012;28:2845–2846. doi: 10.1093/bioinformatics/bts532. - DOI - PMC - PubMed
1. Besnard J, Ruda GF, Setola V, Abecassis K, Rodriguiz RM, Huang XP, Norval S, Sassano MF, Shin AI, Webster LA, Simeons FR, Stojanovski L, Prat A, Seidah NG, Constam DB, Bickerton GR, Read KD, Wetsel WC, Gilbert IH, Roth BL, Hopkins AL. Automated design of ligands to polypharmacological profiles. Nature. 2012;492:215–220. doi: 10.1038/nature11691. - DOI - PMC - PubMed
1. Bordbar A, Monk JM, King ZA, Palsson BO. Constraint-based models predict metabolic and associated cellular functions. Nature Reviews Genetics. 2014;15:107–120. doi: 10.1038/nrg3643. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- BacDive
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction of enzymatic pathways by integrative pathway mapping

Affiliations

Prediction of enzymatic pathways by integrative pathway mapping

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous