. 2010 Aug 18:11:428.

doi: 10.1186/1471-2105-11-428.

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Aydin Albayrak¹, Hasan H Otu, Ugur O Sezerman

Affiliations

PMID: 20718947
PMCID: PMC2936399
DOI: 10.1186/1471-2105-11-428

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Aydin Albayrak et al. BMC Bioinformatics. 2010.

. 2010 Aug 18:11:428.

doi: 10.1186/1471-2105-11-428.

Authors

Aydin Albayrak¹, Hasan H Otu, Ugur O Sezerman

Affiliation

¹ Biological Sciences and Bioengineering, Sabanci University, Orhanli, Tuzla, Istanbul, Turkey.

PMID: 20718947
PMCID: PMC2936399
DOI: 10.1186/1471-2105-11-428

Abstract

Background: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.

Results: We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.

Conclusions: The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences.

PubMed Disclaimer

Figures

**Figure 1**
**Protocol Overview**. For RCM, the original sequences and sequences recoded with reduced alphabets are used to calculate RCM-based distances which are then inputted sequentially to the Neighbor-Joining and Retree programs of the PHYLIP v3.68 package. For MSA, first, alignments are carried out using ClustalW2 with substitution matrices corresponding to each amino acid alphabet. Following bootstrap analysis with ClustalW2, Retree program is used to root the trees with midpoint rooting and to discard branch lengths. Each phylogenetic tree is then inputted to the TBC algorithm along with its attribute file that shows the expert assignment of each sequence to each family to calculate the TBC error.

**Figure 2**
**Tree topology of the simulated dataset**. The identical topology of the three phylogenetic trees (i.e., RCM tree, bootstrap tree and true tree) for the simulated dataset is shown.

**Figure 3**
**Phylogenetic trees of protein families**. RCM trees were drawn using ML15 alphabet. For each family, the taxa corresponding to different subfamilies are colored differently. (A) Crotonases (B) Mandelate racemases (C) Vicinal oxygen chelates (D) Haloacid dehalogenase (E) Nucleotidyl cyclases (F) Acyl transferases (G) GH2 hydrolases

See this image and copyright information in PMC

Cited by

Unearthing the root of amino acid similarity.
Stephenson JD, Freeland SJ. Stephenson JD, et al. J Mol Evol. 2013 Oct;77(4):159-69. doi: 10.1007/s00239-013-9565-0. Epub 2013 Jun 7. J Mol Evol. 2013. PMID: 23743923 Free PMC article.
ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval.
Wang J, Gao X, Wang Q, Li Y. Wang J, et al. BMC Bioinformatics. 2012 May 8;13 Suppl 7(Suppl 7):S2. doi: 10.1186/1471-2105-13-S7-S2. BMC Bioinformatics. 2012. PMID: 22594999 Free PMC article.
Alignment-free sequence comparison: benefits, applications, and tools.
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Zielezinski A, et al. Genome Biol. 2017 Oct 3;18(1):186. doi: 10.1186/s13059-017-1319-7. Genome Biol. 2017. PMID: 28974235 Free PMC article. Review.
Testing robustness of relative complexity measure method constructing robust phylogenetic trees for Galanthus L. using the relative complexity measure.
Bakış Y, Otu HH, Taşçı N, Meydan C, Bilgin N, Yüzbaşıoğlu S, Sezerman OU. Bakış Y, et al. BMC Bioinformatics. 2013 Jan 17;14:20. doi: 10.1186/1471-2105-14-20. BMC Bioinformatics. 2013. PMID: 23323678 Free PMC article.
Novel hydrophobins from Trichoderma define a new hydrophobin subclass: protein properties, evolution, regulation and processing.
Seidl-Seiboth V, Gruber S, Sezerman U, Schwecke T, Albayrak A, Neuhof T, von Döhren H, Baker SE, Kubicek CP. Seidl-Seiboth V, et al. J Mol Evol. 2011 Apr;72(4):339-51. doi: 10.1007/s00239-011-9438-3. Epub 2011 Mar 22. J Mol Evol. 2011. PMID: 21424760

See all "Cited by" articles

References

1. Wallace IM, Higgins DG. Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 2007;8:135. doi: 10.1186/1471-2105-8-135. - DOI - PMC - PubMed
1. Georgi B, Schultz J, Schliep A. Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC Struct Biol. 2009;9:68. doi: 10.1186/1472-6807-9-68. - DOI - PMC - PubMed
1. Kelil A, Wang S, Brzezinski R, Fleury A. CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics. 2007;8:286. doi: 10.1186/1471-2105-8-286. - DOI - PMC - PubMed
1. Lazareva-Ulitsky B, Diemer K, Thomas PD. On the quality of tree-based protein classification. Bioinformatics. 2005;21(9):1876–1890. doi: 10.1093/bioinformatics/bti244. - DOI - PubMed
1. Wicker N, Perrin GR, Thierry JC, Poch O. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol. 2001;18(8):1435–1441. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Affiliation

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources