Phylogenetic mixture models for proteins
- PMID: 18852096
- PMCID: PMC2607422
- DOI: 10.1098/rstb.2008.0180
Phylogenetic mixture models for proteins
Abstract
Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.
Figures


Similar articles
-
An improved general amino acid replacement matrix.Mol Biol Evol. 2008 Jul;25(7):1307-20. doi: 10.1093/molbev/msn067. Epub 2008 Mar 26. Mol Biol Evol. 2008. PMID: 18367465
-
Modeling protein evolution with several amino acid replacement matrices depending on site rates.Mol Biol Evol. 2012 Oct;29(10):2921-36. doi: 10.1093/molbev/mss112. Epub 2012 Apr 6. Mol Biol Evol. 2012. PMID: 22491036
-
ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices.Bioinformatics. 2011 Oct 1;27(19):2758-60. doi: 10.1093/bioinformatics/btr435. Epub 2011 Jul 26. Bioinformatics. 2011. PMID: 21791535
-
Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial.Syst Biol. 2010 May;59(3):277-87. doi: 10.1093/sysbio/syq002. Epub 2010 Mar 10. Syst Biol. 2010. PMID: 20525635
-
Models of molecular evolution and phylogeny.Genome Res. 1998 Dec;8(12):1233-44. doi: 10.1101/gr.8.12.1233. Genome Res. 1998. PMID: 9872979 Review.
Cited by
-
Site-Specific Amino Acid Preferences Are Mostly Conserved in Two Closely Related Protein Homologs.Mol Biol Evol. 2015 Nov;32(11):2944-60. doi: 10.1093/molbev/msv167. Epub 2015 Jul 29. Mol Biol Evol. 2015. PMID: 26226986 Free PMC article.
-
Phylotranscriptomics suggests the jawed vertebrate ancestor could generate diverse helper and regulatory T cell subsets.BMC Evol Biol. 2018 Nov 15;18(1):169. doi: 10.1186/s12862-018-1290-2. BMC Evol Biol. 2018. PMID: 30442091 Free PMC article.
-
Phylogenomic and comparative genomic analyses of Leuconostocaceae species: identification of molecular signatures specific for the genera Leuconostoc, Fructobacillus and Oenococcus and proposal for a novel genus Periweissella gen. nov.Int J Syst Evol Microbiol. 2022 Mar;72(3):005284. doi: 10.1099/ijsem.0.005284. Int J Syst Evol Microbiol. 2022. PMID: 35320068 Free PMC article.
-
Horizontal transfer of vertebrate vision gene IRBP into the chordate ancestor.Proc Natl Acad Sci U S A. 2023 Aug 22;120(34):e2310390120. doi: 10.1073/pnas.2310390120. Epub 2023 Aug 14. Proc Natl Acad Sci U S A. 2023. PMID: 37579156 Free PMC article. No abstract available.
-
Phylogenomic and molecular markers based studies on Staphylococcaceae and Gemella species. Proposals for an emended family Staphylococcaceae and three new families (Abyssicoccaceae fam. nov., Salinicoccaceae fam. nov. and Gemellaceae fam. nov.) harboring four new genera, Lacicoccus gen. nov., Macrococcoides gen. nov., Gemelliphila gen. nov., and Phocicoccus gen. nov.Antonie Van Leeuwenhoek. 2023 Oct;116(10):937-973. doi: 10.1007/s10482-023-01857-6. Epub 2023 Jul 31. Antonie Van Leeuwenhoek. 2023. PMID: 37523090
References
-
- Akaike H. A new look at statistical model identification. IEEE Trans. Automat. Contr. 1974;AU-19:716–722. doi:10.1109/TAC.1974.1100705 - DOI
-
- Bateman A. The Pfam protein families database. Nucleic Acids Res. 2002;30:276–280. http://pfam.cgb.ki.se/ - PMC - PubMed
-
- Betts M.J, Russell R.B. Amino acid properties and consequences of subsitutions. In: Barnes M.R, Gray I.C, editors. Bioinformatics for geneticists. Wiley; New York, NY: 2003.
-
- Bruno W.J. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 1996;13:1368–1374. - PubMed
-
- Bryant D, Galtier N, Poursat M.A. Likelihood calculations in phylogenetics. In: Gascuel O, editor. Mathematics of evolution & phylogeny. Oxford University Press; Oxford, UK: 2005. pp. 33–62.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Miscellaneous