Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jul 21:8:261.
doi: 10.1186/1471-2105-8-261.

Applying negative rule mining to improve genome annotation

Affiliations

Applying negative rule mining to improve genome annotation

Irena I Artamonova et al. BMC Bioinformatics. .

Abstract

Background: Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items.

Results: Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower.

Conclusion: Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of negative association rule strength (probability that a given database entry will satisfy the right side of the rule given that it satisfies the left side of the rule). Minimal coverage counts (number of entries in the database that possess all features from the left hand side of the rule) used are 100 (blue), 200 (pink), and 500 (green). The threshold for minimal leverage count (difference of the actual rule frequency and the probability to find it by chance with the given frequencies of its RHS and LHS) was set to 100 in all calculations
Figure 2
Figure 2
Fraction of annotation terms corrected based on the taxonomic information among all rule exceptions. The number of all exceptions found in each strength interval is shown above each bar.
Figure 3
Figure 3
Coverage of the negative and positive rule mining approaches. The numbers represent the percentage of all annotation features identified as potentially erroneous by each individual method and by both of them

References

    1. Consortium TUP. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2007;35:D193–D197. doi: 10.1093/nar/gkl929. - DOI - PMC - PubMed
    1. Brown D, Sjolander K. Functional classification using phylogenomic inference. PLoS Comput Biol. 2006;2:e77. doi: 10.1371/journal.pcbi.0020077. - DOI - PMC - PubMed
    1. Metzker ML. Emerging technologies in DNA sequencing. Genome Res. 2005;15:1767–1776. doi: 10.1101/gr.3770505. - DOI - PubMed
    1. Bork P, Bairoch A. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1996;12:425–427. doi: 10.1016/0168-9525(96)60040-7. - DOI - PubMed
    1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. - PubMed

Publication types

MeSH terms

LinkOut - more resources