. 2014 Mar;21(3):269-86.

doi: 10.1089/cmb.2013.0099. Epub 2014 Feb 4.

A Bayesian sampler for optimization of protein domain hierarchies

Andrew F Neuwald¹

Affiliations

PMID: 24494927
PMCID: PMC3948484
DOI: 10.1089/cmb.2013.0099

A Bayesian sampler for optimization of protein domain hierarchies

Andrew F Neuwald. J Comput Biol. 2014 Mar.

. 2014 Mar;21(3):269-86.

doi: 10.1089/cmb.2013.0099. Epub 2014 Feb 4.

Author

Andrew F Neuwald¹

Affiliation

¹ Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine , Baltimore, Maryland.

PMID: 24494927
PMCID: PMC3948484
DOI: 10.1089/cmb.2013.0099

Abstract

The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy.

PubMed Disclaimer

Figures

<b>FIG. 1.</b> — **FIG. 1.**
**(A)** Idealized hierarchical model of protein sequence divergence. In this example, potentially thousands of sequences within a large protein class are modeled as 12 subgroups. Intermediate and leaf nodes are shown in brown and green, respectively. **(B)** Schematic drawing of a partitioned alignment (termed a *contrast alignment*) corresponding to the BPPS statistical model. Aligned sequences are assigned to either a “foreground” or a “background” partition (blue and maroon horizontal bars, respectively). Partitioning is based on the conservation of foreground residues (blue vertical bars) that diverge from (or contrast with) the background residues at those positions (white vertical bars). Red vertical bar heights quantify the selective pressure imposed on divergent residue positions. Note that the probability distribution in Equation 1 sums over k aligned columns and over n foreground and background sequences, as indicated. **(C)** Foreground subtree (blue nodes) corresponding to the foreground of the contrast alignment in (B). The rest of the subtree rooted at the parent of the foreground subtree corresponds to the background (maroon nodes). Thus, the collection of such hierarchically arranged contrast alignments (one for each subtree with the tree) identifies the distinguishing patterns associated with functional divergence of the protein class. The sequences associated with node 5, for example, conserve distinguishing patterns corresponding to their membership in the subfamily, family, superfamily, and class represented by the subtrees rooted at nodes 5, 4, 2, and 1, respectively. For the contrast alignment corresponding to the main root (node 1), the background partition is defined by random sequences (not shown). BPPS, Bayesian Partitioning with Pattern Selection.

<b>FIG. 2.</b> — **FIG. 2.**
Operations for adding, moving, and deleting nodes. **(A)** The AddLeaf() operation. This operation is applied to both internal nodes (brown) and existing leaf nodes (green) and can construct a complex hierarchy by repeatedly sampling new leaf nodes. In each step, the resultant hierarchy is subsequently optimized by omcBPPS sampling over sequence assignments and patterns. Nodes selected for spawning off a new leaf node (in the first three steps) are haloed in red. **(B)** The MoveUp() and MoveDown() operations. The nodes in the subtree being moved are shown in blue. **(C)** The Insert() and Delete() operations. Insert() adds a new internal node (shown in blue) into an existing hierarchy. Delete() can remove either an internal node (as shown) or a leaf node.

<b>FIG. 3.</b> — **FIG. 3.**
Correspondence between a tree and an H-table. **(A)** The tree from Figure 1C representing the hierarchical relationships between protein subgroups highlighting the foreground (blue nodes) and background (maroon nodes) nodes for the subtree rooted at node 4. Nonparticipating nodes are shown in gray. **(B)** The corresponding H-table in standard tree format. Names of leaf and internal nodes are terminated by exclamation and question marks respectively. The node 4 subtree is highlighted in the table to correspond to the tree in (A). **(C)** A scrambled version of the table shown in (B), which the Sort() operation unscrambles.

<b>FIG. 4.</b> — **FIG. 4.**
**(A)** Plot of hierarchy LLRs for the 60 domains listed in Table 1. Hierarchies were created by the omcBPPS sampler initialized from three different starting points: *de novo*, □; from a hierarchy created heuristically (Neuwald et al., 2012), ◯; and from a manually curated CDD hierarchy, △. The line corresponds to the average for the three methods. **(B)** Plot of the numbers of nodes obtained for each hierarchy. A scatter plot (inset) of the RSD for the LLR versus the RSD for the numbers of nodes reveals an outlier (top right corner) corresponding to cd08304. CDD, Conserved Domain Database; LLRs, log-likelihood ratios; omcBPPS, optimal multiple-category BPPS; RSD, relative standard deviation.

<b>FIG. 5.</b> — **FIG. 5.**
Contributions of individual nodes to the second level of the curated versus optimal hierarchies for globin-like domains (cd01067). Nodes shown in green and red contribute (positively or negatively, respectively) to the LLR at the second or lower levels of the hierarchy, as plotted below each hierarchy. Other nodes are shown in blue. **(A)** Manually curated hierarchy (LLR score: 126,945 nats; 43 nodes). Numbers are labeled with their CD identifiers. **(B)** Optimized hierarchy obtained using the curated CDD hierarchy as a starting point (223,561 nats; 55 nodes).

<b>FIG. 6.</b> — **FIG. 6.**
Example of a robust hierarchy; the oxidoreductase/nitrogenase domain (cd00316). **(A)** Hierarchy generated *de novo* (96,708 nats; 43 nodes). **(B)** Hierarchy obtained using the curated CDD hierarchy as a starting point (99,377 nats; 45 nodes). The correspondence between nodes of the hierarchies in (A) and in (C) to nodes in the hierarchy in (B) is indicated using the color code defined in the box. **(C)** Hierarchy obtained using a heuristic hierarchy as the starting point (98,708 nats; 46 nodes). **(D)** The CDD manually curated hierarchy (71482 nats; 16 nodes). **(E)** The hierarchy shown in (B) but re-colored for comparison with the NCBI manually curated hierarchy in (D). The light-red nodes are nodes that fail to correspond to a node in (D) and that are less diverse than the least diverse node in (D). Numbers in parentheses correspond to the number of phyla represented by the sequences assigned to those nodes. For comparison, the nodes in the manually curated hierarchy all have sequences from at least four phyla. NCBI, National Center for Biotechnology Information.

<b>FIG. 7.</b> — **FIG. 7.**
Comparison of the hierarchies obtained for the catalytic domain of phosphoinositide 3-kinase (PI3Kc)-like proteins. **(A)** Hierarchy obtained using the curated CDD hierarchy (cd00142) as a starting point (51,784 nats; 33 nodes). Node numbering likewise corresponds to the node numbering in (B). **(B)** A hierarchy generated *de novo* (52,990 nats; 36 nodes). **(C)** Optimal hierarchy obtained using a heuristic hierarchy as the starting point (52,650 nats; 36 nodes).

<b>FIG. 8.</b> — **FIG. 8.**
Hypothetical illustration of patterns inconsistent with a single tree. The same four leaf nodes, each of which harbor two out of four properties (green, yellow, blue, and red), may be arranged in two different ways, where, for each tree, intermediate nodes model only one of the conserved properties.

See this image and copyright information in PMC

Cited by

Tracing the origin and evolution of pseudokinases across the tree of life.
Kwon A, Scott S, Taujale R, Yeung W, Kochut KJ, Eyers PA, Kannan N. Kwon A, et al. Sci Signal. 2019 Apr 23;12(578):eaav3810. doi: 10.1126/scisignal.aav3810. Sci Signal. 2019. PMID: 31015289 Free PMC article.
Deep Analysis of Residue Constraints (DARC): identifying determinants of protein functional specificity.
Tondnevis F, Dudenhausen EE, Miller AM, McKenna R, Altschul SF, Bloom LB, Neuwald AF. Tondnevis F, et al. Sci Rep. 2020 Feb 3;10(1):1691. doi: 10.1038/s41598-019-55118-6. Sci Rep. 2020. PMID: 32015389 Free PMC article.
Inferring joint sequence-structural determinants of protein functional specificity.
Neuwald AF, Aravind L, Altschul SF. Neuwald AF, et al. Elife. 2018 Jan 16;7:e29880. doi: 10.7554/eLife.29880. Elife. 2018. PMID: 29336305 Free PMC article.
Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.
Neuwald AF, Altschul SF. Neuwald AF, et al. PLoS Comput Biol. 2016 May 18;12(5):e1004936. doi: 10.1371/journal.pcbi.1004936. eCollection 2016 May. PLoS Comput Biol. 2016. PMID: 27192614 Free PMC article.
Evolution of Functional Diversity in the Holozoan Tyrosine Kinome.
Yeung W, Kwon A, Taujale R, Bunn C, Venkat A, Kannan N. Yeung W, et al. Mol Biol Evol. 2021 Dec 9;38(12):5625-5639. doi: 10.1093/molbev/msab272. Mol Biol Evol. 2021. PMID: 34515793 Free PMC article.

See all "Cited by" articles

References

1. Abascal F., and Valencia A.2002. Clustering of proximal sequence space for the identification of protein families. Bioinformatics 18, 908–921 - PubMed
1. Engelhardt B.E., Jordan M.I., Srouji J.R., et al. . 2011. Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Res. 21, 1969–1980 - PMC - PubMed
1. Finn R.D., Tate J., Mistry J., et al. . 2008. The Pfam protein families database. Nucleic Acids Res. 36, D281–D288 - PMC - PubMed
1. Hastings W.K.1970. Monte Carlo sampling methods using Markov Chains and their applications. Biometrika 57, 97–109
1. Kirkpatrick S., Gelatt C.D., and Vecchi M.P.1983. Optimization by simulated annealing. Science 220, 671–680 - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

HHSN2630000999571/PHS HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Bayesian sampler for optimization of protein domain hierarchies

Affiliation

A Bayesian sampler for optimization of protein domain hierarchies

Author

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources