Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Feb 9:6:24.
doi: 10.1186/1471-2105-6-24.

Clustering the annotation space of proteins

Affiliations

Clustering the annotation space of proteins

Victor Kunin et al. BMC Bioinformatics. .

Abstract

Background: Current protein clustering methods rely on either sequence or functional similarities between proteins, thereby limiting inferences to one of these areas.

Results: Here we report a new approach, named CLAN, which clusters proteins according to both annotation and sequence similarity. This approach is extremely fast, clustering the complete SwissProt database within minutes. It is also accurate, recovering consistent protein families agreeing on average in more than 97% with sequence-based protein families from Pfam. Discrepancies between sequence- and annotation-based clusters were scrutinized and the reasons reported. We demonstrate examples for each of these cases, and thoroughly discuss an example of a propagated error in SwissProt: a vacuolar ATPase subunit M9.2 erroneously annotated as vacuolar ATP synthase subunit H. CLAN algorithm is available from the authors and the CLAN database is accessible at http://maine.ebi.ac.uk:8000/cgi-bin/clan/ClanSearch.pl

Conclusions: CLAN creates refined function-and-sequence specific protein families that can be used for identification and annotation of unknown family members. It also allows easy identification of erroneous annotations by spotting inconsistencies between similarities on annotation and sequence levels.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Score calculation. An example of calculation of the score by CLAN. A. In the pre-processing stage, a dictionary is constructed with occurrences of terms in the SwissProt. Multiple occurrence of a term in a single entry is counted once. Frequency is calculated by dividing the term occurrence to the number of entries in the database. Numbers are rounded to the third decimal digit. B. An example of calculation of score for an alignment of two actual annotations.

Similar articles

Cited by

References

    1. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–9. doi: 10.1093/bioinformatics/18.12.1641. - DOI - PubMed
    1. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–70. doi: 10.1093/nar/gkg095. - DOI - PMC - PubMed
    1. Yandell MD, Majoros WH. Genomics and natural language processing. Nat Rev Genet. 2002;3:601–10. - PubMed
    1. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2002;30:276–80. doi: 10.1093/nar/30.1.276. - DOI - PMC - PubMed
    1. Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol. 1995;2:171–8. doi: 10.1038/nsb0295-171. - DOI - PubMed

MeSH terms

LinkOut - more resources