Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 25:2:117-31.
doi: 10.1093/gbe/evq004.

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Affiliations

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Scott C Perry et al. Genome Biol Evol. .

Abstract

It is well known that patterns of nucleotide composition vary within and among genomes, although the reasons why these variations exist are not completely understood. Between-genome compositional variation has been exploited to assign environmental shotgun sequences to their most likely originating genomes, whereas within-genome variation has been used to identify recently acquired genetic material such as pathogenicity islands. Recent sequence assignment techniques have achieved high levels of accuracy on artificial data sets, but the relative difficulty of distinguishing lineages with varying degrees of relatedness, and different types of genomic sequence, has not been examined in depth. We investigated the compositional differences in a set of 774 sequenced microbial genomes, finding rapid divergence among closely related genomes, but also convergence of compositional patterns among genomes with similar habitats. Support vector machines were then used to distinguish all pairs of genomes based on genome fragments 500 nucleotides in length. The nearly 300,000 accuracy scores obtained from these trials were used to construct general models of distinguishability versus taxonomic and compositional indices of genomic divergence. Unusual genome pairs were evident from their large residuals relative to the fitted model, and we identified several factors including genome reduction, putative lateral genetic transfer, and habitat convergence that influence the distinguishability of genomes. The positional, compositional, and functional context of a fragment within a genome has a strong influence on its likelihood of correct classification, but in a way that depends on the taxonomic and ecological similarity of the comparator genome.

Keywords: genome composition; metagenomics; phylogenetic classification; support vector machines.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.—
FIG. 1.—
Clustering of 774 prokaryotic genomes from a matrix of PTE distances. Edges in the tree are colored according to the legend if their descendant leaves all belong to the same phylum; internal edges that subtend >1 phylum are black. Numbers and letters indicate sets of genomes that are split or merged in ways that are consistent with genome size or habitat. In the detailed subtrees, individual genomes are identified using genus, species, and NCBI project ID: these identifiers can be cross-referenced with strain and other information at URL http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi. 1: a set of reduced genomes (maximum genome size = 1.1 Mbp) with low genomic G + C content that belong to phyla Bacteroidetes, Tenericutes, and Proteobacteria. 2: dispersal of phylum Aquificae (comprising A. aeolicus, Hydrogenobaculum sp. YO4AAS1, and Sulfurihydrogenibium sp. YO3AOP1) into two distinct groups. Group 2a includes members of phylum Thermotogae including Thermotoga maritima, whereas Group 2b includes mesophilic ϵ-Proteobacteria. 3: clustering of Salinibacter ruber (highlighted) with haloarchaea and methanogens. 4: splitting of sequenced Prochlorococcus marinus genomes (highlighted) into three groups. Group 4a includes the low–light-adapted strains MIT 9313 and MIT 9303, which have relatively large genomes (>2.5 Mbp) in close association with marine Synechococcus, Group 4b includes four low–light-adapted strains with genome sizes ∼1.8 Mbp and close compositional affinities to lactic acid bacteria and the obligate intracellular endosymbiont Candidatus Protochlamydia amoebophila. Group 4c includes the high–light-adapted strains, with the marine α-Proteobacterium Candidatus Pelagibacter ubique.
F<sc>IG</sc>. 2.—
FIG. 2.—
CA versus genetic distance between 16S rDNA genes for 210,439 pairs of genomes. The genome pairs listed in supplementary table S2 (Supplementary Material online) are highlighted with large dots and the identifier for each pair; empty circles indicate 16S distances that were computed from ClustalW2 alignments.
F<sc>IG</sc>. 3.—
FIG. 3.—
CA of all comparisons at each taxonomic level. At each level, CA scores were assigned to 11 bins: a bin for CA 50% and lower, and 10 bins covering intervals of 5% in the range (50%, 100%). Accuracy levels are shown using a color gradient: the deepest blue bar indicates the proportion of comparisons with CA = 50% or less, whereas the top, deep red bar in each column indicates CA > 95%. The lightest colored bar corresponds to 70% < CA ≤ 75%.
F<sc>IG</sc>. 4.—
FIG. 4.—
Mean and range of cluster accuracies for 16 genome pairs. The CA of each cluster was computed in the strict sense, with any assignment to another cluster deemed a misclassification. Minimum, mean, and maximum accuracy scores are shown using bars and diamonds, whereas gray rectangles indicate the overall CA for that comparison when k = 6.
F<sc>IG</sc>. 5.—
FIG. 5.—
Visualization of cluster misclassification between Sulcia muelleri (b1–b6) and Buchnera aphidicola strain Cc (a1–a6). The thickness of the ribbon emanating from the most counterclockwise (e.g., at the left of cluster b5) position of the cluster indicates the proportion of that cluster that was misclassified. The ribbon connected to the most clockwise position of each cluster indicates the number of other fragments that were mistakenly given this cluster assignment by the SVM.
F<sc>IG</sc>. 6.—
FIG. 6.—
Visualization of cluster misclassification between Prochlorococcus marinus strains MIT 9303 (a1–a6) and AS9601 (b1–b6). The thickness of the ribbon emanating from the most counterclockwise (e.g., at the left of cluster b2) position of the cluster indicates the proportion of that cluster that was misclassified. The ribbon connected to the most clockwise position of each cluster indicates the number of other fragments that were mistakenly given this cluster assignment by the SVM.
F<sc>IG</sc>. 7.—
FIG. 7.—
Heatmaps showing the relative frequency of representative tetranucleotides in six unsupervised clusters of fragments from two pairs of genomes. Each individual heatmap corresponds to one numbered cluster of sequences from a given genome, with each row showing the frequency profile for an individual fragment. The color gradient ranges from red (tetranucleotide is absent from a given fragment) through orange and yellow to white (tetranucleotide frequency is maximal given the data set). The mean G + C content for each cluster is indicated in parentheses, whereas colored borders indicate paired clusters that are frequently conflated by the SVM, corresponding to thick connecting edges in figures 5 and 6. (a) Buchnera aphidicola versus Sulcia muelleri, with heatmap columns corresponding to the frequencies of AAAT, TTTC, GCCG, AAAC, TGTA, AACG, GCCA, ATCG, and TTGA. (b) Prochlorococcus marinus MIT 9303 versus P. marinus AS9601, with heatmap columns corresponding to frequencies of AATT, CCAA, CTTC, CGGC, AGCA, CTGG, GTAG, GCAT, GGAT, and ATCA. Although the tetranucleotides with highest loadings on the first ten principal components were chosen to illustrate compositional variation, only nine appear in (a) because the tetranucleotide TGTA had the highest loading on both components 5 and 6.

Similar articles

Cited by

References

    1. Abe T, et al. Informatics for unveiling hidden genome signatures. Genome Res. 2003;13:692–702. - PMC - PubMed
    1. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 2005;12:281–290. - PubMed
    1. Beiko RG, Harlow TJ, Ragan MA. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A. 2005;102:14332–14337. - PMC - PubMed
    1. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1996;83:5155–5159. - PMC - PubMed
    1. Bohlin J, Skjerve E, Ussery DW. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS Comput Biol. 2008;4:e10000057. - PMC - PubMed