. 2005 Feb 9:6:24.

doi: 10.1186/1471-2105-6-24.

Clustering the annotation space of proteins

Victor Kunin¹, Christos A Ouzounis

Affiliations

PMID: 15703069
PMCID: PMC552314
DOI: 10.1186/1471-2105-6-24

Clustering the annotation space of proteins

Victor Kunin et al. BMC Bioinformatics. 2005.

. 2005 Feb 9:6:24.

doi: 10.1186/1471-2105-6-24.

Authors

Victor Kunin¹, Christos A Ouzounis

Affiliation

¹ Computational Genomics Group, EMBL-EBI, Cambridge, CB10 1SO, UK. kunin@ebi.ac.uk

PMID: 15703069
PMCID: PMC552314
DOI: 10.1186/1471-2105-6-24

Abstract

Background: Current protein clustering methods rely on either sequence or functional similarities between proteins, thereby limiting inferences to one of these areas.

Results: Here we report a new approach, named CLAN, which clusters proteins according to both annotation and sequence similarity. This approach is extremely fast, clustering the complete SwissProt database within minutes. It is also accurate, recovering consistent protein families agreeing on average in more than 97% with sequence-based protein families from Pfam. Discrepancies between sequence- and annotation-based clusters were scrutinized and the reasons reported. We demonstrate examples for each of these cases, and thoroughly discuss an example of a propagated error in SwissProt: a vacuolar ATPase subunit M9.2 erroneously annotated as vacuolar ATP synthase subunit H. CLAN algorithm is available from the authors and the CLAN database is accessible at http://maine.ebi.ac.uk:8000/cgi-bin/clan/ClanSearch.pl

Conclusions: CLAN creates refined function-and-sequence specific protein families that can be used for identification and annotation of unknown family members. It also allows easy identification of erroneous annotations by spotting inconsistencies between similarities on annotation and sequence levels.

PubMed Disclaimer

Figures

**Figure 1**
**Score calculation.** An example of calculation of the score by CLAN. A. In the pre-processing stage, a dictionary is constructed with occurrences of terms in the SwissProt. Multiple occurrence of a term in a single entry is counted once. Frequency is calculated by dividing the term occurrence to the number of entries in the database. Numbers are rounded to the third decimal digit. B. An example of calculation of score for an alignment of two actual annotations.

See this image and copyright information in PMC

Cited by

BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins.
Kankainen M, Ojala T, Holm L. Kankainen M, et al. BMC Bioinformatics. 2012 Feb 15;13:33. doi: 10.1186/1471-2105-13-33. BMC Bioinformatics. 2012. PMID: 22335941 Free PMC article.
Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure.
Promponas VJ, Iliopoulos I, Ouzounis CA. Promponas VJ, et al. Stand Genomic Sci. 2015 Nov 19;10:108. doi: 10.1186/s40793-015-0101-2. eCollection 2015. Stand Genomic Sci. 2015. PMID: 26594309 Free PMC article.
Cluster analysis of protein array results via similarity of Gene Ontology annotation.
Wolting C, McGlade CJ, Tritchler D. Wolting C, et al. BMC Bioinformatics. 2006 Jul 12;7:338. doi: 10.1186/1471-2105-7-338. BMC Bioinformatics. 2006. PMID: 16836750 Free PMC article.
Novel knowledge-based mean force potential at the profile level.
Dong Q, Wang X, Lin L. Dong Q, et al. BMC Bioinformatics. 2006 Jun 27;7:324. doi: 10.1186/1471-2105-7-324. BMC Bioinformatics. 2006. PMID: 16803615 Free PMC article.
A bioinformatician's guide to metagenomics.
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. Kunin V, et al. Microbiol Mol Biol Rev. 2008 Dec;72(4):557-78, Table of Contents. doi: 10.1128/MMBR.00009-08. Microbiol Mol Biol Rev. 2008. PMID: 19052320 Free PMC article. Review.

See all "Cited by" articles

References

1. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–9. doi: 10.1093/bioinformatics/18.12.1641. - DOI - PubMed
1. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–70. doi: 10.1093/nar/gkg095. - DOI - PMC - PubMed
1. Yandell MD, Majoros WH. Genomics and natural language processing. Nat Rev Genet. 2002;3:601–10. - PubMed
1. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2002;30:276–80. doi: 10.1093/nar/30.1.276. - DOI - PMC - PubMed
1. Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995;2:171–8. doi: 10.1038/nsb0295-171. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering the annotation space of proteins

Affiliation

Clustering the annotation space of proteins

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources