Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 18:11:428.
doi: 10.1186/1471-2105-11-428.

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Affiliations

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

Aydin Albayrak et al. BMC Bioinformatics. .

Abstract

Background: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.

Results: We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.

Conclusions: The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Protocol Overview. For RCM, the original sequences and sequences recoded with reduced alphabets are used to calculate RCM-based distances which are then inputted sequentially to the Neighbor-Joining and Retree programs of the PHYLIP v3.68 package. For MSA, first, alignments are carried out using ClustalW2 with substitution matrices corresponding to each amino acid alphabet. Following bootstrap analysis with ClustalW2, Retree program is used to root the trees with midpoint rooting and to discard branch lengths. Each phylogenetic tree is then inputted to the TBC algorithm along with its attribute file that shows the expert assignment of each sequence to each family to calculate the TBC error.
Figure 2
Figure 2
Tree topology of the simulated dataset. The identical topology of the three phylogenetic trees (i.e., RCM tree, bootstrap tree and true tree) for the simulated dataset is shown.
Figure 3
Figure 3
Phylogenetic trees of protein families. RCM trees were drawn using ML15 alphabet. For each family, the taxa corresponding to different subfamilies are colored differently. (A) Crotonases (B) Mandelate racemases (C) Vicinal oxygen chelates (D) Haloacid dehalogenase (E) Nucleotidyl cyclases (F) Acyl transferases (G) GH2 hydrolases

Similar articles

Cited by

References

    1. Wallace IM, Higgins DG. Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 2007;8:135. doi: 10.1186/1471-2105-8-135. - DOI - PMC - PubMed
    1. Georgi B, Schultz J, Schliep A. Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC Struct Biol. 2009;9:68. doi: 10.1186/1472-6807-9-68. - DOI - PMC - PubMed
    1. Kelil A, Wang S, Brzezinski R, Fleury A. CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics. 2007;8:286. doi: 10.1186/1471-2105-8-286. - DOI - PMC - PubMed
    1. Lazareva-Ulitsky B, Diemer K, Thomas PD. On the quality of tree-based protein classification. Bioinformatics. 2005;21(9):1876–1890. doi: 10.1093/bioinformatics/bti244. - DOI - PubMed
    1. Wicker N, Perrin GR, Thierry JC, Poch O. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol. 2001;18(8):1435–1441. - PubMed

Publication types

LinkOut - more resources