Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;6(1):34-40.
doi: 10.1038/nchembio.266. Epub 2009 Nov 22.

Automatic policing of biochemical annotations using genomic correlations

Affiliations

Automatic policing of biochemical annotations using genomic correlations

Tzu-Lin Hsiao et al. Nat Chem Biol. 2010 Jan.

Abstract

With the increasing role of computational tools in the analysis of sequenced genomes, there is an urgent need to maintain high accuracy of functional annotations. Misannotations can be easily generated and propagated through databases by functional transfer based on sequence homology. We developed and optimized an automatic policing method to detect biochemical misannotations using context genomic correlations. The method works by finding genes with unusually weak genomic correlations in their assigned network positions. We demonstrate the accuracy of the method using a cross-validated approach. In addition, we show that the method identifies a significant number of potential misannotations in Bacillus subtilis, including metabolic assignments already shown to be incorrect experimentally. The experimental analysis of the mispredicted genes forming the leucine degradation pathway in B. subtilis demonstrates that computational policing tools can generate important biological hypotheses.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of the developed approach. In the figure network nodes represent metabolic genes and edges represent connections established by shared metabolites. Using sequence homology, genes X and Y from different organisms have been assigned to EC 1.2.3.4. Gene X displays strong context-based correlations (darker blue indicating stronger correlations) with neighboring network genes. Consequently, the annotation of X is likely to be correct. In contrast, gene Y does not fit well in the assigned network position and is likely to be misannotated.
Figure 2
Figure 2
Performance on identifying misannotations. a) The ROC curves on different types of artificially generated misannotations in the yeast network. The True Negative set 1 (TN1) was generated by randomly assigning incorrect metabolic functions to a fraction of network genes. The TN2 set was generated by reassigning network genes to new metabolic activities only if they had at least 30% sequence identities to newly assigned (incorrect) activities. The TN3 was generated by assigning genes to new activities only if they had similar (within 10%) or higher sequence identities to the reassigned (incorrect) activities. In all cases the remaining (not reassigned) activities were used as true positive examples. For realistic misannotation models, simulated by the sets TN2 and TN3, the method correctly identifies about 70%–90% of misannotations at 5%–15% false positive rate. The red dot in the figure approximately corresponds to 70% true positives and 5% false positives. b) The cumulative distributions of the classification confidence scores for B. subtilis metabolic assignments. The B. subtilis annotations made simultaneously by all analyzed databases (KEGG, MetaCyc and Swiss-Prot) are shown in red, annotations unique to KEGG, MetaCyc, or Swiss-Prot, are shown in black. For comparison we also show the true negative set TN3 from S. cerevisiae in blue. The cumulative distributions demonstrate that the consensus annotations (red) are, on average, more accurate than the ones unique to individual databases (blue, Kolmogorov-Smirnov test P=2*10−19). However, on average, database-specific annotations still score significantly better than true misannotations (KS P=2*10−4).
Figure 3
Figure 3
Function of genes forming the yng cluster in B. subtilis. a) The genomic positions of the yng genes are shown in green. The detected misannotations are indicated in red. The predicted functions, forming the degradation pathway, are shown in blue. The expression of all yng gene is controlled by the σE transcription factor; the gene mmgA is also under the σE control and is responsible for the last step of the leucine catabolism. b) Fractional 13C labeling of Acetyl-CoA in the wild type sporulating cells and in the sporulating yng mutants. The 13C labeling in the figure indicates the fraction of the Acetyl-CoA isotopomer generated from leucine in sporulating cells only (see Methods). The errors in the figure represent SEM. The background Acetyl-CoA isotopomer labeling is shown by the dash line.
Figure 4
Figure 4

References

    1. Andrade MA, et al. Automated genome sequence analysis and annotation. Bioinformatics. 1999;15:391–412. - PubMed
    1. Rost B. Enzyme function less conserved than anticipated. J. Mol. Biol. 2002;318:595–608. - PubMed
    1. Tian W, Skolnick J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 2003;333:863–882. - PubMed
    1. Brenner SE. Errors in genome annotation. Trends Genet. 1999;15:132–133. - PubMed
    1. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics. 2002;18:1641–1649. - PubMed