Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002:3:5.
doi: 10.1186/1471-2105-3-5. Epub 2002 Feb 5.

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes

Affiliations

Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes

Stéphanie Bocs et al. BMC Bioinformatics. 2002.

Abstract

Background: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach.

Results: We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank.

Conclusions: The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).

PubMed Disclaimer

Figures

Figure 2
Figure 2
Assignation of a status to some additional CDSs. A. The annotated Genes Not Found by the AMIGA method (CDSd). B. The potential AMIGA New Genes (CDSa). The procedure takes into account the length of the CDS, its coding probability, results of similarity search in the non-redundant protein databank and overlaps between adjacent CDSs, these CDSs being an AMIGA CDS (CDSa) and a databank CDS (CDSd) (see text). Although all situations are investigated in the procedure, there are obviously preferred ways (thick arrows): for example a CDSa of the lst-NG>=Sure-Pc list is often found with no overlap with a CDSd. In this case, the CDSa often has a length below 300 bp and, either no similarity (AMBIGUOUS status) or similarity (NEW status) with proteins in the databank. If a CDSa does overlap a CDSd, the last one often has a weak coding probability and no similarity with proteins in the databank (in this case, the CDSa has the NEW status). Therefore it is extremely rare to found a CDSa of the lst-NG>=Sure-Pc in overlap with a CDSd having a strong coding probability, this overlap between the two CDSs being also important (broken arrows). In case of A. pernix and P. horikoshii the threshold for the CDSd length has been fixed to 600 bp instead of 300 bp. This choice is motivated by the nature of the annotation procedure of the authors of the genome sequences (see text). (L) length; (Pc) coding probability; (lst-NG>=Sure-Pc) list of CDSa having a coding probability above 0.4; (lst-GNF<Min-Pc) list of CDSd having a coding probability below 0.2.
Figure 1
Figure 1
Overall strategy of the CDSs (re-)annotation of the bacterial genomes. The procedure involves four main steps (see text), the latter being performed on potential New Genes having a coding probability above 0.4 (list Ist-NG>=Sure-Pc), and on annotated Genes Not Found having a coding probability below the 0.2 (list lst-GNF<Min-Pc). (WWDDL) World-Wide DNA Data Library (GenBank/EMBL-EBI/DDBJ); (Pc) coding probability.

References

    1. Fickett JW. Finding genes by computer: the state of the art. Trends Genet. 1996;12:316–320. doi: 10.1016/0168-9525(96)10038-X. - DOI - PubMed
    1. Borodovsky M, McIninch JD. GeneMark: Parallel gene recognition for both DNA strands. Comp. 1993;17:123–133. doi: 10.1016/0097-8485(93)85004-V. - DOI
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Robison K, Gilbert W, Church GM. Large scale bacterial gene discovery by similarity search. Nature Genetics. 1994;7:205–214. - PubMed
    1. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 2001;17:425–428. doi: 10.1016/S0168-9525(01)02372-1. - DOI - PubMed

Publication types

LinkOut - more resources