Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2001 Oct;11(10):1632-40.
doi: 10.1101/gr.183801.

Annotation transfer for genomics: measuring functional divergence in multi-domain proteins

Affiliations
Comparative Study

Annotation transfer for genomics: measuring functional divergence in multi-domain proteins

H Hegyi et al. Genome Res. 2001 Oct.

Abstract

Annotation transfer is a principal process in genome annotation. It involves "transferring" structural and functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is important that this process be robust and statistically well-characterized, especially with regard to how it depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, present more complex issues in functional conservation. Here we present a large-scale survey of annotation transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have significantly less functional conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the probability of their sharing the same function increases to 80% in the case of complete coverage along the full length of both proteins, this value increases further to > 90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these were associated with the same function in both categories of proteins. We also investigated the degree to which function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence similarity between them, finding that functional divergence at a given amount of sequence similarity is always about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic illustrating annotation transfer. This figure illustrates the process of annotation transfer for a group of hypothetical TIM barrel proteins. The leftmost panel represents sequence comparisons between idealized barrel domains from a number of organisms. The next panel shows analogous results for structural comparison, and the panel after that, functional comparison. The rightmost panel represents sequence comparisons between idealized multi-domain proteins that match over a single domain, the subject of much of this paper.
Figure 2
Figure 2
Distribution of multi-domain combinations amongst the genomes. The figure shows the occurrence of multi-domain fold combinations in a number of genomes, indicating its great variability. Each row indicates a particular combination of scop fold pairs (using scop 1.39), where a fold pair is defined as two distinct folds occurring in tandem in a protein. Each column represents a different genome, using the four-letter codes in the PartsList system (Qian et al. 2001): Aaeo, Aquifex aeolicus; Aful, Archaeoglobus fulgidus; Bbur, Borrelia burgdorferi; Bsub, Bacillus subtilis; Cele, Caenorhabditis elegans; Cpne, Chlamydia pneumoniae; Ctra, Chlamydia trachomatis; Ecol, Echerischia coli; Hinf, Haemophilus influenzae Rd; Hpyl, Helicobacter pylori; Mthe, Methanobacterium thermoautotrophicum; Mjan, Methanococcus jannaschii; Mtub, Mycobacterium tuberculosis; Mgen, Mycoplasma genitalium; Mpne, Mycoplasma pneumoniae; Phor, Pyrococcus horikoshii; Rpro, Rickettsia prowazekii; Scer, Saccharomyces cerevisiae; Syne, Synechocystis sp.; Tpal, Treponema pallidum. The numbers in each intersection cell indicate the number of times the fold pairs occur in a genome. Only the 20 most common fold pair combinations are shown here; the remainder are shown on the Web site (http://partslist.org/func). If a cell is greater than 6, it is shaded black; between 3 and 6, gray; and below 3, white. The blank spaces show instances in which one of the pairs does not occur in the organism at all (indicated by a value of -1 in the data table on the Web site). The fold assignments are done in a fashion consistent with those in PartsList and associated systems (Gerstein 1997; Lin et al. 2000; Drawid et al. 2001; Harrison et al. 2001; Qian et al. 2001).
Figure 3
Figure 3
Distribution of proteins amongst broad structural and functional classes; the distribution of the matches among the seven structural and two functional classes in single- and multi-domain proteins. The single-domain and multi-domain matches each total 100%, independently of each other. The horizontal axis indicates the seven scop classes, which are (from 1 to 7): all-alpha, all-beta, alpha/beta, alpha + beta, multi-domain, membrane, and small protein.
Figure 4
Figure 4
Divergence in function with respect to sequence similarity. Relative number of matching domains with multiple functions, as the function of e-value threshold. Diamonds represent single-domain proteins, squares multi-domain ones (matching just for a single domain), respectively. The first value on the X-axis starts at 4 (corresponding to an e-value=10−4).

Similar articles

Cited by

References

    1. Altschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Bairoch A. The ENZYME database in 2000. Nucleic Acids Res. 2000;28:304–5. - PMC - PubMed
    1. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–8. - PMC - PubMed
    1. Chothia C, Lesk A M. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. - PMC - PubMed
    1. Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–107. - PubMed

Publication types

MeSH terms

LinkOut - more resources