. 2010 Jan 25:2:117-31.

doi: 10.1093/gbe/evq004.

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Scott C Perry¹, Robert G Beiko

Affiliations

PMID: 20333228
PMCID: PMC2839357
DOI: 10.1093/gbe/evq004

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Scott C Perry et al. Genome Biol Evol. 2010.

. 2010 Jan 25:2:117-31.

doi: 10.1093/gbe/evq004.

Authors

Scott C Perry¹, Robert G Beiko

Affiliation

¹ Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada.

PMID: 20333228
PMCID: PMC2839357
DOI: 10.1093/gbe/evq004

Abstract

It is well known that patterns of nucleotide composition vary within and among genomes, although the reasons why these variations exist are not completely understood. Between-genome compositional variation has been exploited to assign environmental shotgun sequences to their most likely originating genomes, whereas within-genome variation has been used to identify recently acquired genetic material such as pathogenicity islands. Recent sequence assignment techniques have achieved high levels of accuracy on artificial data sets, but the relative difficulty of distinguishing lineages with varying degrees of relatedness, and different types of genomic sequence, has not been examined in depth. We investigated the compositional differences in a set of 774 sequenced microbial genomes, finding rapid divergence among closely related genomes, but also convergence of compositional patterns among genomes with similar habitats. Support vector machines were then used to distinguish all pairs of genomes based on genome fragments 500 nucleotides in length. The nearly 300,000 accuracy scores obtained from these trials were used to construct general models of distinguishability versus taxonomic and compositional indices of genomic divergence. Unusual genome pairs were evident from their large residuals relative to the fitted model, and we identified several factors including genome reduction, putative lateral genetic transfer, and habitat convergence that influence the distinguishability of genomes. The positional, compositional, and functional context of a fragment within a genome has a strong influence on its likelihood of correct classification, but in a way that depends on the taxonomic and ecological similarity of the comparator genome.

Keywords: genome composition; metagenomics; phylogenetic classification; support vector machines.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.— — **FIG. 1.—**
Clustering of 774 prokaryotic genomes from a matrix of PTE distances. Edges in the tree are colored according to the legend if their descendant leaves all belong to the same phylum; internal edges that subtend >1 phylum are black. Numbers and letters indicate sets of genomes that are split or merged in ways that are consistent with genome size or habitat. In the detailed subtrees, individual genomes are identified using genus, species, and NCBI project ID: these identifiers can be cross-referenced with strain and other information at URL http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi. 1: a set of reduced genomes (maximum genome size = 1.1 Mbp) with low genomic G + C content that belong to phyla Bacteroidetes, Tenericutes, and Proteobacteria. 2: dispersal of phylum Aquificae (comprising *A. aeolicus*, *Hydrogenobaculum* sp. YO4AAS1, and *Sulfurihydrogenibium* sp. YO3AOP1) into two distinct groups. Group 2a includes members of phylum Thermotogae including *Thermotoga maritima*, whereas Group 2b includes mesophilic ϵ-Proteobacteria. 3: clustering of *Salinibacter ruber* (highlighted) with haloarchaea and methanogens. 4: splitting of sequenced *Prochlorococcus marinus* genomes (highlighted) into three groups. Group 4a includes the low–light-adapted strains MIT 9313 and MIT 9303, which have relatively large genomes (>2.5 Mbp) in close association with marine *Synechococcus*, Group 4b includes four low–light-adapted strains with genome sizes ∼1.8 Mbp and close compositional affinities to lactic acid bacteria and the obligate intracellular endosymbiont Candidatus *Protochlamydia amoebophila*. Group 4c includes the high–light-adapted strains, with the marine α-Proteobacterium Candidatus *Pelagibacter ubique*.

F<sc>IG</sc>. 2.— — **FIG. 2.—**
CA versus genetic distance between 16S rDNA genes for 210,439 pairs of genomes. The genome pairs listed in supplementary table S2 (Supplementary Material online) are highlighted with large dots and the identifier for each pair; empty circles indicate 16S distances that were computed from ClustalW2 alignments.

F<sc>IG</sc>. 3.— — **FIG. 3.—**
CA of all comparisons at each taxonomic level. At each level, CA scores were assigned to 11 bins: a bin for CA 50% and lower, and 10 bins covering intervals of 5% in the range (50%, 100%). Accuracy levels are shown using a color gradient: the deepest blue bar indicates the proportion of comparisons with CA = 50% or less, whereas the top, deep red bar in each column indicates CA > 95%. The lightest colored bar corresponds to 70% < CA ≤ 75%.

F<sc>IG</sc>. 4.— — **FIG. 4.—**
Mean and range of cluster accuracies for 16 genome pairs. The CA of each cluster was computed in the strict sense, with any assignment to another cluster deemed a misclassification. Minimum, mean, and maximum accuracy scores are shown using bars and diamonds, whereas gray rectangles indicate the overall CA for that comparison when k = 6.

F<sc>IG</sc>. 5.— — **FIG. 5.—**
Visualization of cluster misclassification between *Sulcia muelleri* (b1–b6) and *Buchnera aphidicola* strain Cc (a1–a6). The thickness of the ribbon emanating from the most counterclockwise (e.g., at the left of cluster b5) position of the cluster indicates the proportion of that cluster that was misclassified. The ribbon connected to the most clockwise position of each cluster indicates the number of other fragments that were mistakenly given this cluster assignment by the SVM.

F<sc>IG</sc>. 6.— — **FIG. 6.—**
Visualization of cluster misclassification between *Prochlorococcus marinus* strains MIT 9303 (a1–a6) and AS9601 (b1–b6). The thickness of the ribbon emanating from the most counterclockwise (e.g., at the left of cluster b2) position of the cluster indicates the proportion of that cluster that was misclassified. The ribbon connected to the most clockwise position of each cluster indicates the number of other fragments that were mistakenly given this cluster assignment by the SVM.

F<sc>IG</sc>. 7.— — **FIG. 7.—**
Heatmaps showing the relative frequency of representative tetranucleotides in six unsupervised clusters of fragments from two pairs of genomes. Each individual heatmap corresponds to one numbered cluster of sequences from a given genome, with each row showing the frequency profile for an individual fragment. The color gradient ranges from red (tetranucleotide is absent from a given fragment) through orange and yellow to white (tetranucleotide frequency is maximal given the data set). The mean G + C content for each cluster is indicated in parentheses, whereas colored borders indicate paired clusters that are frequently conflated by the SVM, corresponding to thick connecting edges in figures 5 and 6. (a) *Buchnera aphidicola* versus *Sulcia muelleri*, with heatmap columns corresponding to the frequencies of AAAT, TTTC, GCCG, AAAC, TGTA, AACG, GCCA, ATCG, and TTGA. (b) *Prochlorococcus marinus* MIT 9303 versus *P. marinus* AS9601, with heatmap columns corresponding to frequencies of AATT, CCAA, CTTC, CGGC, AGCA, CTGG, GTAG, GCAT, GGAT, and ATCA. Although the tetranucleotides with highest loadings on the first ten principal components were chosen to illustrate compositional variation, only nine appear in (a) because the tetranucleotide TGTA had the highest loading on both components 5 and 6.

See this image and copyright information in PMC

Cited by

Genome Signature Difference between Deinococcus radiodurans and Thermus thermophilus.
Nishida H, Abe R, Nagayama T, Yano K. Nishida H, et al. Int J Evol Biol. 2012;2012:205274. doi: 10.1155/2012/205274. Epub 2012 Mar 4. Int J Evol Biol. 2012. PMID: 22500246 Free PMC article.
A Markovian analysis of bacterial genome sequence constraints.
Skewes AD, Welch RD. Skewes AD, et al. PeerJ. 2013 Aug 29;1:e127. doi: 10.7717/peerj.127. eCollection 2013. PeerJ. 2013. PMID: 24010012 Free PMC article.
Resolving prokaryotic taxonomy without rRNA: longer oligonucleotide word lengths improve genome and metagenome taxonomic classification.
Alsop EB, Raymond J. Alsop EB, et al. PLoS One. 2013 Jul 1;8(7):e67337. doi: 10.1371/journal.pone.0067337. Print 2013. PLoS One. 2013. PMID: 23840870 Free PMC article.
Computational tools for viral metagenomics and their application in clinical research.
Fancello L, Raoult D, Desnues C. Fancello L, et al. Virology. 2012 Dec 20;434(2):162-74. doi: 10.1016/j.virol.2012.09.025. Epub 2012 Oct 11. Virology. 2012. PMID: 23062738 Free PMC article. Review.
Classifying short genomic fragments from novel lineages using composition and homology.
Parks DH, MacDonald NJ, Beiko RG. Parks DH, et al. BMC Bioinformatics. 2011 Aug 9;12:328. doi: 10.1186/1471-2105-12-328. BMC Bioinformatics. 2011. PMID: 21827705 Free PMC article.

See all "Cited by" articles

References

1. Abe T, et al. Informatics for unveiling hidden genome signatures. Genome Res. 2003;13:692–702. - PMC - PubMed
1. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 2005;12:281–290. - PubMed
1. Beiko RG, Harlow TJ, Ragan MA. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A. 2005;102:14332–14337. - PMC - PubMed
1. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A. 1996;83:5155–5159. - PMC - PubMed
1. Bohlin J, Skjerve E, Ussery DW. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS Comput Biol. 2008;4:e10000057. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- BacDive

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Affiliation

Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases