Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Aug 3:8:284.
doi: 10.1186/1471-2105-8-284.

Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach

Carson Andorf et al. BMC Bioinformatics. .

Abstract

Background: Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.

Results: In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database.

Conclusion: We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note: Authors from the original publication (Okazaki et al.: Nature 2002, 420:563-73) have provided their response to Andorf et al, directly following the correspondence.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of Ser/Thr, Tyr, and dual specificity kinases among annotated protein kinases in human versus mouse genomes [see Additional file 9]. Pie charts illustrate the functional family distribution of protein kinases in human (top) versus mouse (bottom), based on: a. AmiGO functional classifications: Ser/Thr (GO0004674) [Blue]; Tyr (GO0004713) [Red] or "dual specificity" (proteins with both GO classifications) [Yellow]. b. UniProt annotations: classification based on UniProt records containing the key words Ser/Thr [Blue], Tyr [Red], or dual specificity [Yellow] [see Additional file 2]. c. Predicted annotations by the HDTree classifier: The classifier was built on human proteins with functional labels Ser/Thr (GO0004674) [Blue], Tyr (GO0004713) [Red] or "dual specificity" [Yellow] derived from AmiGO and verified by UniProt [see Additional file 4].
Figure 2
Figure 2
Comparison of UniProt annotations of mouse protein kinase sequences with annotations from AmiGO or predicted by HDTree. The bar charts illustrate the number of proteins that were in agreement (blue)/disagreement (red) with the annotations found in UniProt. Proteins that belong to each of the three functional classes found in the UniProt records are represented by two bars. The blue bar represents the number of proteins in which UniProt and the given method share the same annotation (agreement) for that function. The red bar represents the number of proteins in which UniProt and the given method have different annotations (disagreement) for that function. a. AmiGO vs. UniProt annotations b. HDTree predictions vs. UniProt annotations [see Additional files 3 and 4].

References

    1. The Gene Ontology Consortium Gene ontology: tool for the unification of biology. Nature Genet. 2000;25:25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Doerks T, Bairoch A, Bork P. Protein annotation : detective work for function prediction. Trends Genet. 1998;14:248–250. doi: 10.1016/S0168-9525(98)01486-3. - DOI - PubMed
    1. Bork P, Koonin EV. Predicting functions from protein sequences – where are the bottlenecks? Nat Genet. 1998;18:313–318. doi: 10.1038/ng0498-313. - DOI - PubMed
    1. Gilks WR, Audit B, de Angelis D, Tsoka S, Ouzounis CA. Percolation of annotation errors through hierarchically structured protein sequence databases. Math Biosci. 2005;193:223–234. doi: 10.1016/j.mbs.2004.08.001. - DOI - PubMed
    1. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–1649. doi: 10.1093/bioinformatics/18.12.1641. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources