Maximum-Likelihood Phylogenetic Inference with Selection on Protein Folding Stability

Miguel Arenas¹, Agustin Sánchez-Cobos¹, Ugo Bastolla²

Affiliations

¹ Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain.
² Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain ubastolla@cbm.csic.es.

PMID: 25837579
PMCID: PMC4833071
DOI: 10.1093/molbev/msv085

Maximum-Likelihood Phylogenetic Inference with Selection on Protein Folding Stability

Miguel Arenas et al. Mol Biol Evol. 2015 Aug.

. 2015 Aug;32(8):2195-207.

doi: 10.1093/molbev/msv085. Epub 2015 Apr 2.

Authors

Miguel Arenas¹, Agustin Sánchez-Cobos¹, Ugo Bastolla²

Affiliations

¹ Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain.
² Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain ubastolla@cbm.csic.es.

PMID: 25837579
PMCID: PMC4833071
DOI: 10.1093/molbev/msv085

Abstract

Despite intense work, incorporating constraints on protein native structures into the mathematical models of molecular evolution remains difficult, because most models and programs assume that protein sites evolve independently, whereas protein stability is maintained by interactions between sites. Here, we address this problem by developing a new mean-field substitution model that generates independent site-specific amino acid distributions with constraints on the stability of the native state against both unfolding and misfolding. The model depends on a background distribution of amino acids and one selection parameter that we fix maximizing the likelihood of the observed protein sequence. The analytic solution of the model shows that the main determinant of the site-specific distributions is the number of native contacts of the site and that the most variable sites are those with an intermediate number of native contacts. The mean-field models obtained, taking into account misfolded conformations, yield larger likelihood than models that only consider the native state, because their average hydrophobicity is more realistic, and they produce on the average stable sequences for most proteins. We evaluated the mean-field model with respect to empirical substitution models on 12 test data sets of different protein families. In all cases, the observed site-specific sequence profiles presented smaller Kullback-Leibler divergence from the mean-field distributions than from the empirical substitution model. Next, we obtained substitution rates combining the mean-field frequencies with an empirical substitution model. The resulting mean-field substitution model assigns larger likelihood than the empirical model to all studied families when we consider sequences with identity larger than 0.35, plausibly a condition that enforces conservation of the native structure across the family. We found that the mean-field model performs better than other structurally constrained models with similar or higher complexity. With respect to the much more complex model recently developed by Bordner and Mittelmann, which takes into account pairwise terms in the amino acid distributions and also optimizes the exchangeability matrix, our model performed worse for data with small sequence divergence but better for data with larger sequence divergence. The mean-field model has been implemented into the computer program Prot_Evol that is freely available at http://ub.cbm.uam.es/software/Prot_Evol.php.

Keywords: folding stability; maximum-likelihood estimate; misfolded state; structurally constrained substitution models.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1. — **Fig. 1.**
Site-specific average hydrophobicity (left) and entropy (right) of the MF distributions as a function of the number of native contacts for the protein with PDB code 153l. As expected, there is a very strong correlation between hydrophobicity and number of contacts and the entropy reaches a maximum at an intermediate number of contacts.

F<sc>ig</sc>. 2. — **Fig. 2.**
Left: Log likelihood of various MF models as a function of the log likelihood of the purely mutation model. Each point represents a protein. Right: Mean log likelihood of the five types of MF models. The plotted statistical errors show that differences are significant except for the two rightmost bars.

F<sc>ig</sc>. 3. — **Fig. 3.**
Left: Average hydrophobicity of the MF models versus the average hydrophobicity of the PDB sequence. Each point represents a protein. Right: Average folding free energy (native minus misfolded) $\bar{Δ G}$ of the MF models versus the average folding free energy of the PDB sequence. $\bar{Δ G} < 0$ means that the MF model describes on the average stable proteins. Each point represents a protein.

F<sc>ig</sc>. 4. — **Fig. 4.**
Mean log-likelihood of the proteins in the test set with respect to the model $P_{a}^{MF 2, i}$ versus the temperature in arbitrary units set by our contact interaction energy function.

F<sc>ig</sc>. 5. — **Fig. 5.**
Difference of KL divergence from the observed amino acid profile between the empirical model and the MF model (KLDobs_emp–KLDobs_mf) for the 12 studied protein families, under different conditions on the minimum sequence identity allowed. Positive differences mean that the observed profile agrees better with the MF model than with the empirical model.

See this image and copyright information in PMC

References

1. Akaike H. A new look at the statistical model identification. IEEE Trans Automatic Control. 1974;19:716–723.
1. Arenas M, Dos Santos HG, Posada D, Bastolla U. Protein evolution along phylogenetic histories under structurally constrained substitution models. Bioinformatics. 2013;29:3020–3028. - PMC - PubMed
1. Babajide A, Hofacker IL, Sippl MJ, Stadler PF. Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. Fold Des. 1997;2:261–269. - PubMed
1. Bastolla U, Farwer J, Knapp EW, Vendruscolo M. How to guarantee optimal stability for most representative structures in the Protein Data Bank. Proteins. 2001;44:79–96. - PubMed
1. Bastolla U, Moya A, Viguera E, van Ham RC. Genomic determinants of protein folding thermodynamics in prokaryotic organisms. J Mol Biol. 2004;343:1451–1466. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Maximum-Likelihood Phylogenetic Inference with Selection on Protein Folding Stability

Affiliations

Maximum-Likelihood Phylogenetic Inference with Selection on Protein Folding Stability

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources