Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Aug 1;31(15):4632-8.
doi: 10.1093/nar/gkg495.

Protein families and TRIBES in genome sequence space

Affiliations

Protein families and TRIBES in genome sequence space

Anton J Enright et al. Nucleic Acids Res. .

Abstract

Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Phylogenetic distribution of protein families in the Tribes database. The numbers show relative abundance of protein families unique to each domain as well as shared ones across the three domains of life.
Figure 2
Figure 2
Correspondence between number of genes and number of Tribes families in available genomes. Colours show correspodence to the domains of life, and eukaryotic genomes are named.
Figure 3
Figure 3
The power law distribution of the Tribes family sizes. Counts of families for each family size are shown.
Figure 4
Figure 4
Distribution of protein lengths in the Tribes families of various sizes. Note that smaller families are composed of shorter proteins (see text for discussion).
Figure 5
Figure 5
Annotation categories for protein families in the three domains of life. The analysis was performed with GeneQuiz (29). (Top) Families are divided according to annotation quality (current knowledge) and (bottom) according to functional class membership.

Similar articles

Cited by

References

    1. Eisenberg D., Marcotte,E.M., Xenarios,I. and Yeates,T.O. (2000) Protein function in the post-genomic era. Nature, 405, 823–826. - PubMed
    1. Tatusov R.L., Koonin,E.V. and Lipman,D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637. - PubMed
    1. Doolittle R.F. (1981) Similar amino acid sequences: chance or common ancestry? Science, 214, 149–159. - PubMed
    1. Chothia C. and Lesk,A.M. (1986) The relationship between the divergence of sequence and structure in proteins. EMBO J., 5, 823–826. - PMC - PubMed
    1. Devos D. and Valencia,A. (2000) Practical limits of function prediction. Proteins, 41, 98–107. - PubMed