Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets
- PMID: 20718947
- PMCID: PMC2936399
- DOI: 10.1186/1471-2105-11-428
Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets
Abstract
Background: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.
Results: We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.
Conclusions: The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences.
Figures



Similar articles
-
CLUSS: clustering of protein sequences based on a new similarity measure.BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286. BMC Bioinformatics. 2007. PMID: 17683581 Free PMC article.
-
On the quality of tree-based protein classification.Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12. Bioinformatics. 2005. PMID: 15647305
-
Clustering of protein domains for functional and evolutionary studies.BMC Bioinformatics. 2009 Oct 15;10:335. doi: 10.1186/1471-2105-10-335. BMC Bioinformatics. 2009. PMID: 19832975 Free PMC article.
-
Automated alphabet reduction for protein datasets.BMC Bioinformatics. 2009 Jan 6;10:6. doi: 10.1186/1471-2105-10-6. BMC Bioinformatics. 2009. PMID: 19126227 Free PMC article.
-
Research progress of reduced amino acid alphabets in protein analysis and prediction.Comput Struct Biotechnol J. 2022 Jul 4;20:3503-3510. doi: 10.1016/j.csbj.2022.07.001. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 35860409 Free PMC article. Review.
Cited by
-
Unearthing the root of amino acid similarity.J Mol Evol. 2013 Oct;77(4):159-69. doi: 10.1007/s00239-013-9565-0. Epub 2013 Jun 7. J Mol Evol. 2013. PMID: 23743923 Free PMC article.
-
ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval.BMC Bioinformatics. 2012 May 8;13 Suppl 7(Suppl 7):S2. doi: 10.1186/1471-2105-13-S7-S2. BMC Bioinformatics. 2012. PMID: 22594999 Free PMC article.
-
Alignment-free sequence comparison: benefits, applications, and tools.Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7. Genome Biol. 2017. PMID: 28974235 Free PMC article. Review.
-
Testing robustness of relative complexity measure method constructing robust phylogenetic trees for Galanthus L. using the relative complexity measure.BMC Bioinformatics. 2013 Jan 17;14:20. doi: 10.1186/1471-2105-14-20. BMC Bioinformatics. 2013. PMID: 23323678 Free PMC article.
-
Novel hydrophobins from Trichoderma define a new hydrophobin subclass: protein properties, evolution, regulation and processing.J Mol Evol. 2011 Apr;72(4):339-51. doi: 10.1007/s00239-011-9438-3. Epub 2011 Mar 22. J Mol Evol. 2011. PMID: 21424760
References
-
- Wicker N, Perrin GR, Thierry JC, Poch O. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol. 2001;18(8):1435–1441. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources