Learning generative models for protein fold families

Sivaraman Balakrishnan¹, Hetunandan Kamisetty, Jaime G Carbonell, Su-In Lee, Christopher James Langmead

Affiliations

PMID: 21268112
DOI: 10.1002/prot.22934

Learning generative models for protein fold families

Sivaraman Balakrishnan et al. Proteins. 2011 Apr.

. 2011 Apr;79(4):1061-78.

doi: 10.1002/prot.22934. Epub 2011 Jan 25.

Authors

Sivaraman Balakrishnan¹, Hetunandan Kamisetty, Jaime G Carbonell, Su-In Lee, Christopher James Langmead

Affiliation

¹ Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

PMID: 21268112
DOI: 10.1002/prot.22934

Abstract

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.

PubMed Disclaimer

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Wiley
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Learning generative models for protein fold families

Affiliation

Learning generative models for protein fold families

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources