Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;93(1):18-25.
doi: 10.1016/j.tube.2012.11.012. Epub 2012 Dec 26.

Reannotation of translational start sites in the genome of Mycobacterium tuberculosis

Affiliations

Reannotation of translational start sites in the genome of Mycobacterium tuberculosis

Michael A DeJesus et al. Tuberculosis (Edinb). 2013 Jan.

Abstract

Identification and correction of incorrect ORF start sites is important for a variety of experimental and analytical purposes, ranging from cloning to inference of operon structure. The genome of the H37Rv reference strain of Mycobacterium tuberculosis (Mtb) was originally annotated when it was first sequenced nearly 15 years ago. While this annotation has served the TB research community well as a standard of reference for over a decade, it has been demonstrated experimentally that the actual start sites for an estimated 5-10% of open reading frames differ from the annotation. In this paper, we present a comprehensive bioinformatic analysis of all 3989 ORFs (open reading frames) in the M. tuberculosis H37Rv genome. Our method combines information from comparative analysis (alignment to start sites of orthologs in other Actinobacteria), sequence conservation, "protein likeness", putative ribosome binding sites, and other data to identify translational start sites. The features are combined in a linear model that is trained on dataset of known start sites verified by mass spectrometry, with a cross-validated accuracy of 94%. The method can be viewed as an augmentation of Hidden Markov Model-based tools such as Glimmer and GeneMark by incorporating more information than just the raw genomic sequence to decide which position is the legitimate translational start site for each ORF. Using this analysis, we identify 269 genes that most likely need to be re-annotated, and identify the best alterative translational start site for each. These revised ORF definitions could be used in the reannotation of the H37Rv genome, as well as to prioritize genes for experimental start-site validation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Histogram of number of orthologs for genes in H37Rv among 10 related mycobacterial species.
Figure 2
Figure 2
Multiple alignment of orthologs of Rv0557 (PimB). The best ortholog of Rv0557 in 10 other organisms was identified, and the genomic region within ±150 bp of the annotated start sites were translated into amino acids and aligned by ClustalW. Start codons are underlinded. Gaps are indicated by ‘-‘. Stop codons are indicated by ‘*’. The hypothetical translation of upstream regions is shown in lower case letters. The positions where a start codon was observed in any other organism are shown in yellow. The positions where legitimate alternative start codons occur in the Mtb sequence are shown in magenta. Annotated start in Mtb is colored in cyan.
Figure 3
Figure 3
Histogram of number of different start sites observed in multiple alignment among orthologs for each gene in H37Rv.
Figure 4
Figure 4
Histogram of delta score, representing consistency of start sites. The number of orthologs with start sites matching the annotated start site in H37Rv is compared to the number of orthologs agreeing with the start site with most orthologs overall. Smaller delta values represent more consistency.
Figure 5
Figure 5
Histogram of codon distance between the annotated start site in H37Rv and the alternative start site agreed upon by the majority of orthologs.
Figure 6
Figure 6
Average conservation score. The line with filled circles shows the plot of smoothed conservation score, γ¯(i), characterizing the average sequence conservation of 10 amino-acids downstream of a given position was calculated for locations around the annotated start site (‘offset’), averaged across all genes. Larger values represent higher sequence conservation. The line with empty squares shows an approximation of the derivative, γ’(i), used to distinguish the transition point that is observed at start sites.
Figure 7
Figure 7
Average Protein-Likeness score. The line of filled circles shows a plot of protein-likeness score, π(i), characterizing how protein-like a sequence of 10 amino-acids downstream of a given start site was calculated for positions around the annotated start site (‘offset’), averaged across all genes. Smaller values represent sequences that are more protein-like. The line with empty squares represents an approximation of the derivative, π‘(i), which was used to distinguish the transition that is observed at start sites.
Figure 8
Figure 8
a) Distribution of nucleotides in the untranslated region upstream of start codons for all ORFs in the Mtb genome. b) The distribution of positions at which 5 bp purine subsequences start (if multiple instances occur, the one closest to the start codon is chosen).

Similar articles

Cited by

References

    1. Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE, 3rd, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Krogh A, McLean J, Moule S, Murphy L, Oliver K, Osborne J, Quail MA, Rajandream MA, Rogers J, Rutter S, Seeger K, Skelton J, Squares R, Squares S, Sulston JE, Taylor K, Whitehead S, Barrell BG. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393:537–544. doi: 10.1038/31159. - PubMed
    1. Camus JC, Pryor MJ, Medigue C, Cole ST. Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. Microbiology. 2002;148:2967–2973. - PubMed
    1. Rison SC, Mattow J, Jungblut PR, Stoker NG. Experimental determination of translational starts using peptide mass mapping and tandem mass spectrometry within the proteome of Mycobacterium tuberculosis. Microbiology. 2007;153:521–528. - PMC - PubMed
    1. Smollett KL, Fivian-Hughes AS, Smith JE, Chang A, Rao T, Davis EO. Experimental determination of translational start sites resolves uncertainties in genomic open reading frame predictions - application to Mycobacterium tuberculosis. Microbiology. 2009;155:186–197. doi: 155/1/186 [pii] 10.1099/mic.0.022889-0. - PMC - PubMed
    1. Strong M, Mallick P, Pellegrini M, Thompson MJ, Eisenberg D. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol. 2003;4:R59. doi: 10.1186/gb-2003-4-9-r59. - PMC - PubMed

Publication types

Substances

LinkOut - more resources