Missing genes in the annotation of prokaryotic genomes

Andrew S Warren¹, Jeremy Archuleta, Wu-Chun Feng, João Carlos Setubal

Affiliations

PMID: 20230630
PMCID: PMC3098052
DOI: 10.1186/1471-2105-11-131

Missing genes in the annotation of prokaryotic genomes

Andrew S Warren et al. BMC Bioinformatics. 2010.

. 2010 Mar 15:11:131.

doi: 10.1186/1471-2105-11-131.

Authors

Andrew S Warren¹, Jeremy Archuleta, Wu-Chun Feng, João Carlos Setubal

Affiliation

¹ Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA. anwarren@vt.edu

PMID: 20230630
PMCID: PMC3098052
DOI: 10.1186/1471-2105-11-131

Abstract

Background: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.

Results: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.

Conclusions: Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

PubMed Disclaimer

Figures

**Figure 1**
**Sequence search setup**. Process of creating subject DB and query sequences.

**Figure 2**
**ORF category breakdown**. All ORFs generated for prokaryotic replicons from RefSeq.

**Figure 3**
**α score distribution**. **Panel a**: Distribution of α scores for missing genes, missing gene groups, and absent annotations. **Panel b**: Distribution of alpha scores for missing genes from groups that do and do not have a representative alignment to nr-aa. Density refers to kernel density [41,42]. Kernel density graphs were generated using the R sm package [42,43], where the bandwidth (smoothing parameter) is calculated as the mean of the normal optimal values for the different groups. Kernel density plots can be thought of as smooth histograms using a Gaussian function centered at each observation, instead of a box. This explains why the left and right tails extend beyond the defined bounds of the α function (0 and 100).

**Figure 4**
**Distribution of taxonomic orders**. The distribution of taxonomic orders among missing genes. This histogram contains more orders than Table 1, hence the category 'other' is not directly comparable.

**Figure 5**
**Missing gene group multiple alignment**. A multiple alignment of missing gene Group 32. The green box shows the upstream RBS site "AGGAG". And the red lines mark the boundaries of the conserved ORF. The multiple alignment includes an additional 30 bp upstream and downstream of genomic DNA.

**Figure 6**
**Length distribution**. Distribution of sequence length for missing genes, missing gene groups, and absent annotations.

See this image and copyright information in PMC

References

1. Galperin MY, Koonin EV. 'Conserved hypothetical' proteins: prioritization of targets for experimental study. Nucleic Acids Research. 2004;32(18):5452–63. doi: 10.1093/nar/gkh885. - DOI - PMC - PubMed
1. Roberts RJ. Identifying protein function-a call for community action. PLoS Biology. 2004;2(3):E42. doi: 10.1371/journal.pbio.0020042. - DOI - PMC - PubMed
1. Frishman D. Protein annotation at genomic scale: the current status. Chemical Reviews. 2007;107(8):3448–66. doi: 10.1021/cr068303k. - DOI - PubMed
1. Larsen TS, Krogh A. EasyGene-a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics. 2003;4:21. doi: 10.1186/1471-2105-4-21. - DOI - PMC - PubMed
1. Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23(6):673–679. doi: 10.1093/bioinformatics/btm009. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Missing genes in the annotation of prokaryotic genomes

Affiliation

Missing genes in the annotation of prokaryotic genomes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous