Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar 20:2012:bas003.
doi: 10.1093/database/bas003. Print 2012.

AntiFam: a tool to help identify spurious ORFs in protein annotation

Affiliations

AntiFam: a tool to help identify spurious ORFs in protein annotation

Ruth Y Eberhardt et al. Database (Oxford). .

Abstract

As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Seed alignment for the AntiFam family derived from PF10695. Amino acids are colored by average similarity according to the BLOSUM62 amino acid substitution matrix from most similar (light blue) to less similar (gray). ‘S’ and ‘E’ in the first row stand for sequence start and sequence end, respectively. The final row features a consensus sequence. The alignment was displayed using the Belvu software (http://www.sanger.ac.uk/resources/software/seqtools/).
Figure 2.
Figure 2.
Graphical representation of exemplar overlapping and spurious proteins. (a) shows two proteins from the Corynebacterium efficiens genome that encode components of a restriction system. The C-termini of the two proteins overlap by 97 nt. (b) Two highly overlapping predicted proteins from the Rhodopirellula baltica genome coded on opposite strands of DNA. The Q7UY10 protein contains two Pfam DUF1596 domains. There is no evidence that these are true expressed proteins. Green boxes represent regions matched by Pfam families, the red shaded areas represent transmembrane domains predicted by Phobius (10) and the blue shaded areas represent regions of low complexity (11).

References

    1. Magrane M, Consortium U. UniProt Knowledgebase: a hub of integrated protein data. Database. 2011;2011 bar009. - PMC - PubMed
    1. Brenner SE. Errors in genome annotation. Trends Genet. 1999;15:132–133. - PubMed
    1. Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 2009;5:e1000605. - PMC - PubMed
    1. Bork P, Bairoch A. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1996;12:425–427. - PubMed
    1. Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23:673–679. - PMC - PubMed

Publication types