Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Mar;93(3):241-56.
doi: 10.1016/j.cmpb.2008.10.014. Epub 2008 Dec 19.

n-Gram characterization of genomic islands in bacterial genomes

Affiliations

n-Gram characterization of genomic islands in bacterial genomes

Gordana M Pavlović-Lazetić et al. Comput Methods Programs Biomed. 2009 Mar.

Abstract

The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Distribution of 5-gram percentages in E. coli EDL933. Distribution of 5-gram percentages in E. coli EDL933 complete genome, the backbone sequence, OIs from the AC class (a) and OIs from the 5DC class (b). Labeled coordinate axes for the complete genome subfigure are presented only. The histogram's x-axis shows percentage intervals, y-axis shows number of different 5-grams with percent occurrence falling into specific intervals (e.g., in the complete genome there are 18 tetragrams with percentages in the interval 0.00–0.025%). x-Values range from 0 to the percent occurrence of the most frequent 5-gram (in the corresponding sequence, e.g., it is 0.28% for the complete E. coli EDL933 genome and corresponds to the most frequent 5-gram CCAGC). y-Values range from 0 to the number of different 5-grams with percent occurrence belonging to the modal interval (e.g., in the complete genome, there are 132 5-grams with percentages in the interval 0.08125–0.09375%); y-values sum up to 1024 (number of different 5-grams). Axes scales for all the subfigures are the same as the ones presented in the complete genome subfigure. Similarity in shape is noticeable among sequences from the AC class represented in (a), as well as their dissimilarity with sequences in the 5DC class represented in (b).
Fig. 2
Fig. 2
Comparative Zipf analysis. The 20 topmost overrepresented and underrepresented tetragrams in the complete E. coli EDL933 genome – subfigures (1), (4), OI #28 – subfigures (2), (5) and OI #51 – subfigures (3), (6), with the corresponding z-values for the biased (according to the maximal order Markov model) tetragram frequency. The red line marked with triangles (▴) represents z-values of tetragram frequencies in descending order for the chosen sequence (complete genome, OI), while the blue line marked with circles (●) represents z-values of the corresponding tetragram frequencies in the backbone, and all the other lines represent z-values of the corresponding tetragram frequencies in other sequences. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
Fig. 3
Fig. 3
Comparative Zipf analysis. All the tetragrams (x-axis) are sorted according to descending z-value of the zM statistic (y-axis) in the complete genome sequence (thick line), and z-values for the corresponding tetragrams in all the other sequences are presented. Although the most overrepresented tetragrams may deviate highly among the sequences (left part of the figure), the most underrepresented ones tend to coincide (right part of the figure).
Fig. 4
Fig. 4
z values in 10 kb windows for tetragrams overrepresented in the backbone of the E. coli EDL933 and underrepresented in 5DC OIs. z-Plots of the most underrepresented tetragrams in each of the 5DC OIs, which are overrepresented in the backbone, are presented (ACTA is the most underrepresented in the OI #8 among the tetragrams overrepresented in the backbone, and similarly TCAT in the OI #28, GTAA in the OI #30, CTTC in the OI #51, ATCC in the OI #84, CGTT in the OI #108, ATGT in the OI #115, TTAT in the OI #138, GAGA in the OI #148). Tetragrams are ordered upward by the corresponding OIs ordering (first tetragram for the OI #8, followed by tetragrams for OI #28, #30, #51, etc.) Narrow vertical rectangles delimit each of the OIs and the color of each of them is the same as the z-plot of the tetragram the most underrepresented in it. Horizontal lines represent z values of −1 and +1 (boundary of under-overrepresentation).

Similar articles

Cited by

References

    1. Hacker J., Carniel E. Ecological fitness, genomic islands and bacterial pathogenicity. EMBO Rep. 2001;2(5):376–381. - PMC - PubMed
    1. Karlin S. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 2001;9(7):335–343. - PubMed
    1. Blattner F.R., Ill G.P., Blochet C.A., Perna N.T., Burland V., Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F., Gregor J., Davis N.V., Kirkpatrick H.A., Goeden M.A., Rose D.J., Mau B., Shao Y. The complete genome sequence of Escherichia coli K-12. Science. 1997;277:1453–1462. - PubMed
    1. Lloyd A.L., Rasko D.A., Mobley H.L.T. Defining genomic islands and uropathogen-specific genes in uropathogenic Escherichia coli. J. Bacteriol. 2007;189:3532–3546. - PMC - PubMed
    1. Perna N.T., Plunkett G., III, Burland V., Mau B., Glasner J.D., Rose D.J., Mayhew G.F., Evans P.S., Gregor J., Kirkpatrick H.A., Posfai G., Hackett J., Klink S., Boutin A., Shao Y., Miller L., Grotbeck E.J., Davis N.W., Limk A., Dimalantak E.T., Potamousis K.D., Apodaca J., Anantharaman T.S., Lin J., Yen G., Schwartz D.C., Welch R.A., Blattner F.R. Genome sequence of enterohemorrhagic Escherichia coli O157:H7. Nature. 2001;409:529–533. - PubMed

Publication types