Phylogenetic mixture models for proteins

Si Quang Le¹, Nicolas Lartillot, Olivier Gascuel

Affiliations

PMID: 18852096
PMCID: PMC2607422
DOI: 10.1098/rstb.2008.0180

Phylogenetic mixture models for proteins

Si Quang Le et al. Philos Trans R Soc Lond B Biol Sci. 2008.

. 2008 Dec 27;363(1512):3965-76.

doi: 10.1098/rstb.2008.0180.

Authors

Si Quang Le¹, Nicolas Lartillot, Olivier Gascuel

Affiliation

¹ Méthodes et Algorithmes pour Bioinformatique, LIRMM, CNRS - Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France.

PMID: 18852096
PMCID: PMC2607422
DOI: 10.1098/rstb.2008.0180

Abstract

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.

PubMed Disclaimer

Figures

**Figure 1**
AIC/site gain compared to LG. Note. All models are compared to LG. Negative gains (JTT and WAG) mean that the models are worse than LG, while positive gains correspond to (mixture) models that improve LG. The gains are provided for all 57 TreeBase test alignments (white bars) and the 8 alignments with saturation index per site larger than 2 (black bars).

**Figure 2**
Number of alignments with better/worse likelihood values than LG. Note. Number of alignments (among the 57 TreeBase test alignments) where each model provides a better (positive side) and a worse (negative side) likelihood value than LG. The black bars correspond to the numbers of significant differences using the Kishino–Hasegawa test with p<0.01. White bars correspond to non-significant differences.

See this image and copyright information in PMC

References

1. Akaike H. A new look at statistical model identification. IEEE Trans. Automat. Contr. 1974;AU-19:716–722. doi:10.1109/TAC.1974.1100705 - DOI
1. Bateman A. The Pfam protein families database. Nucleic Acids Res. 2002;30:276–280. http://pfam.cgb.ki.se/ - PMC - PubMed
1. Betts M.J, Russell R.B. Amino acid properties and consequences of subsitutions. In: Barnes M.R, Gray I.C, editors. Bioinformatics for geneticists. Wiley; New York, NY: 2003.
1. Bruno W.J. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 1996;13:1368–1374. - PubMed
1. Bryant D, Galtier N, Poursat M.A. Likelihood calculations in phylogenetics. In: Gascuel O, editor. Mathematics of evolution & phylogeny. Oxford University Press; Oxford, UK: 2005. pp. 33–62.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Phylogenetic mixture models for proteins

Affiliation

Phylogenetic mixture models for proteins

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous