Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 13:12:32.
doi: 10.1186/1471-2164-12-32.

Genomic fluidity: an integrative view of gene diversity within microbial populations

Affiliations

Genomic fluidity: an integrative view of gene diversity within microbial populations

Andrey O Kislyuk et al. BMC Genomics. .

Abstract

Background: The dual concepts of pan and core genomes have been widely adopted as means to assess the distribution of gene families within microbial species and genera. The core genome is the set of genes shared by a group of organisms; the pan genome is the set of all genes seen in any of these organisms. A variety of methods have provided drastically different estimates of the sizes of pan and core genomes from sequenced representatives of the same groups of bacteria.

Results: We use a combination of mathematical, statistical and computational methods to show that current predictions of pan and core genome sizes may have no correspondence to true values. Pan and core genome size estimates are problematic because they depend on the estimation of the occurrence of rare genes and genomes, respectively, which are difficult to estimate precisely because they are rare. Instead, we introduce and evaluate a robust metric - genomic fluidity - to categorize the gene-level similarity among groups of sequenced isolates. Genomic fluidity is a measure of the dissimilarity of genomes evaluated at the gene level.

Conclusions: The genomic fluidity of a population can be estimated accurately given a small number of sequenced genomes. Further, the genomic fluidity of groups of organisms can be compared robustly despite variation in algorithms used to identify genes and their homologs. As such, we recommend that genomic fluidity be used in place of pan and core genome size estimates when assessing gene diversity within genomes of a species or a group of closely related organisms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Radically different pan and core genome sizes cannot be estimated from sampled genomes. (A) Two species with vastly different true gene distributions: (i) Species A (blue) w/pan genome of 105 genes and core genome of 103 genes; (ii) Species B (green) w/pan genome of 107 genes and core genome of 10 genes. Each genome has 2000 genes randomly chosen from the true gene distribution according to its frequency. (B) The number of genes (y-axis) observed as a function of the number of sampled genomes (x-axis). Note that despite differences in the true distribution, the observed gene distributions are statistically indistinguishable given 100 sampled genomes. For example, there were approximately 2200 genes found in just 1 of 100 genomes for both Species A and Species B. (C) Observed pan genome size as a function of the number of sampled genomes. There is no possibility to extrapolate the true pan genome size from the observed pan genome curves. (See Additional file 1, Figure S1 for further details.) (D) Observed core genome size as a function of the number of sampled genomes. There is no possibility to extrapolate the true core genome size from the observed core genome curves.
Figure 2
Figure 2
True differences in genomic fluidity φ can be detected from a small number of sampled genomes. (A) Two species with subtle differences in true gene distributions: (i) Species A (blue) as in Figure 1, w/pan genome of 105 genes and core genome of 103 genes; (ii) Species C (red) w/pan genome of 105 genes and core genome of 103 genes. Each genome has 2000 genes randomly chosen from the true gene distribution according to its frequency. (B) The number of genes (y-axis) observed as a function of the number of sampled genomes (x-axis). The observed gene distributions are statistically distinguishable. (C) Fluidity as a function of the number of sampled genomes is an unbiased estimator of the true value (dashed lines within red and blue shaded regions). The shaded regions denote the theoretical prediction for mean and standard deviations as inferred from the jackknife estimate (see Methods).
Figure 3
Figure 3
Schematic of bioinformatics fluidity pipeline. (A) Genomes are annotated automatically to minimize curation bias [39]; (B) For a given pair of genomes, all genes are compared using an all vs. all protein alignment; (C) Shared genes are identified based on whether alignment identity and coverage exceed i and c respectively; (D) Gene families are calculated based on a maximal clustering rule; (E) The number of shared genes is found for each pair of genomes, Gi and Gj, from which the number of unique genes can be calculated. Refer to the Methods for complete details of the pipeline and Additional file 1, Table S1 for a complete list of genomes analyzed.
Figure 4
Figure 4
Estimates of mean fluidity converge with increases in sampled genomes. Fluidity was calculated as described in the text given alignment parameters i = 0.74 and c = 0.74. The variance of fluidity is estimated as a total variance, containing both the variance due to subsampling within the sample of genomes, and the variance due to the limited number of sampled genomes. For dependence of fluidity on genomes sampled for the two other sets of alignment parameters in Figure 5, see Additional file 1, Figures S3-S4.
Figure 5
Figure 5
Estimates of mean and standard deviation of fluidity for 7 multiply-sequenced species. Mean and standard deviation of φ are calculated for B. anthracis (Ba), E. coli (Ec), and N. meningitides Nm). Staph. aureus (Sa), Strep. agalactiae (Sag). Strep. pneumoniae (Spn), and Strep. pyogenes (Spy) as a function of alignment parameters. Although fluidity increases with higher values of identity (i) and coverage (c) (see Additional file 1, Figure S5), only three rank-orderings of fluidity (of 5040 possible orderings) are found in 224/225 combinations of alignment parameters.
Figure 6
Figure 6
Fluidity increases with phylogenetic scale. Fluidity of multiply-resequenced species is in the range of 0.1 - 0.3 and the fluidity of all genomes included in the analysis approaches 1. Each circle represents the relative fluidity at a species (with multiple sequenced genomes) or internal node (the fluidity of all the genomes in the tree below it). Open circles are φ = 1 and black circles are φ = 0. The phylogenetic tree of 29 bacterial species was assembled using AMPHORA [2]. Branch lengths correspond to the average number of amino acid substitutions per position in well-conserved marker genes.

References

    1. Lander ES. et al.Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 2008;9(10):R151. doi: 10.1186/gb-2008-9-10-r151. - DOI - PMC - PubMed
    1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24(3):133–141. - PubMed
    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. - DOI - PubMed
    1. Tettelin H. et al.Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome". Proc Natl Acad Sci USA. 2005;102(39):13950–13955. doi: 10.1073/pnas.0506758102. - DOI - PMC - PubMed

Publication types