Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 5:16:34.
doi: 10.1186/s12859-014-0445-4.

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Affiliations

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Juliana S Bernardes et al. BMC Bioinformatics. .

Abstract

Background: An important problem in computational biology is the automatic detection of protein families (groups of homologous sequences). Clustering sequences into families is at the heart of most comparative studies dealing with protein evolution, structure, and function. Many methods have been developed for this task, and they perform reasonably well (over 0.88 of F-measure) when grouping proteins with high sequence identity. However, for highly diverged proteins the performance of these methods can be much lower, mainly because a common evolutionary origin is not deduced directly from sequence similarity. To the best of our knowledge, a systematic evaluation of clustering methods over distant homologous proteins is still lacking.

Results: We performed a comparative assessment of four clustering algorithms: Markov Clustering (MCL), Transitive Clustering (TransClust), Spectral Clustering of Protein Sequences (SCPS), and High-Fidelity clustering of protein sequences (HiFix), considering several datasets with different levels of sequence similarity. Two types of similarity measures, required by the clustering sequence methods, were used to evaluate the performance of the algorithms: the standard measure obtained from sequence-sequence comparisons, and a novel measure based on profile-profile comparisons, used here for the first time.

Conclusions: The results reveal low clustering performance for the highly divergent datasets when the standard measure was used. However, the novel measure based on profile-profile comparisons substantially improved the performance of the four methods, especially when very low sequence identity datasets were evaluated. We also performed a parameter optimization step to determine the best configuration for each clustering method. We found that TransClust clearly outperformed the other methods for most datasets. This work also provides guidelines for the practical application of clustering sequence methods aimed at detecting accurately groups of related protein sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of minimum BLAST e-values for GOLD and ASTRAL A-10 datasets. The GOLD dataset (a) is a collection of enzymes that were manually assigned to protein families/super-families. A-10 (b) is an ASTRAL subset of the SCOP database that contains only sequences with identities less than 10%. For each protein in both datasets, we considered the e-value to the nearest neighbor from its own family/superfamily (intra curves) and the e-value to the nearest neighbor from any other family/superfamily (inter curves).
Figure 2
Figure 2
F-measure improvement when using profile-profile comparisons. Ratio between profile-profile comparison and sequence-sequence comparison F-measures for families (a) and super-families (b).
Figure 3
Figure 3
Distribution of minimum e-values intra and inter families across all datasets. Curves for Astral subsets A-10, A-20, A-30, A-50, A-70, A-90 and A-95 are showed in panels a, b, c, d, e, f and g respectively, and curves for Gold database is showed in panel h. E-values associated with sequence-sequence comparisons (SSCs) were computed by BLAST, while e-values related to profile-profile comparisons (PPCs) were obtained by combining HHBlits and HHsearch. For each protein in the datasets, we considered the e-value to the nearest neighbor from its own family (intra curves) and the e-value to the nearest neighbor from any other family (inter curves). Solid lines indicate BLAST e-values and dashed lines indicate HHsearch e-values.
Figure 4
Figure 4
Distribution of minimum e-values intra and inter super-families across all datasets. Curves for Astral subsets A-10, A-20, A-30, A-50, A-70, A-90 and A-95 are showed in panels a, b, c, d, e, f and g respectively, and curves for Gold database is showed in panel h. E-values associated with sequence-sequence comparisons (SSCs) were computed by BLAST, while e-values related to profile-profile comparisons (PPCs) were obtained by combining HHBlits and HHsearch. For each protein in the datasets, we considered the e-value to the nearest neighbor from its own super-family (intra curves) and the e-value to the nearest neighbor from any other super-family (inter curves). Solid lines indicate BLAST e-values and dashed lines indicate HHsearch e-values.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. Brown S, Gerlt J, Seffernick J, Babbitt P. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 2006;7:8–1815. doi: 10.1186/gb-2006-7-1-r8. - DOI - PMC - PubMed
    1. Conte LL, Ailey B, Hubbard TJP, Brenner SE, Murzin AG, Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 2000;28:257–9. doi: 10.1093/nar/28.1.257. - DOI - PMC - PubMed
    1. Enright AJ, van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–84. doi: 10.1093/nar/30.7.1575. - DOI - PMC - PubMed
    1. Wittkop T, Emig D, Lange S, Rahmann S, Albrecht M, Morris JH, et al. Partitioning biological data with transitivity clustering. Nat Methods. 2010;7:419–420. doi: 10.1038/nmeth0610-419. - DOI - PubMed

Publication types