Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 24:10:834.
doi: 10.3389/fmicb.2019.00834. eCollection 2019.

Large-Scale Genomics Reveals the Genetic Characteristics of Seven Species and Importance of Phylogenetic Distance for Estimating Pan-Genome Size

Affiliations

Large-Scale Genomics Reveals the Genetic Characteristics of Seven Species and Importance of Phylogenetic Distance for Estimating Pan-Genome Size

Sang-Cheol Park et al. Front Microbiol. .

Abstract

For more than a decade, pan-genome analysis has been applied as an effective method for explaining the genetic contents variation of prokaryotic species. However, genomic characteristics and detailed structures of gene pools have not been fully clarified, because most studies have used a small number of genomes. Here, we constructed pan-genomes of seven species in order to elucidate variations in the genetic contents of >27,000 genomes belonging to Streptococcus pneumoniae, Staphylococcus aureus subsp. aureus, Salmonella enterica subsp. enterica, Escherichia coli and Shigella spp., Mycobacterium tuberculosis complex, Pseudomonas aeruginosa, and Acinetobacter baumannii. This work showed the pan-genomes of all seven species has open property. Additionally, systematic evaluation of the characteristics of their pan-genome revealed that phylogenetic distance provided valuable information for estimating the parameters for pan-genome size among several models including Heaps' law. Our results provide a better understanding of the species and a solution to minimize sampling biases associated with genome-sequencing preferences for pathogenic strains.

Keywords: Heaps’ law; core-genome; estimation model; gene pool; large-scale genomics; pan-genome; seven species.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Growth curve of the pan- and core genomes of all seven species. Each color denotes the seven species. Each of the values represent the average number of core- and pan-genome sizes from 100 randomly generated strain orders. In the core-genome graph, the fluctuation at the beginning occurs, because the core-genome cut-off value was fixed at the ratio of the number of genomes used.
Figure 2
Figure 2
Gene-sharing ratio within species. The number of genes in one strain corresponds to the radius of the circle. For each gene, the ratio of the number of strains carrying the gene in the total strains of one species was calculated and defined as the gene-sharing ratio. The ratio was colored according to 11 different colors depending on the value. The lighter the color (closer to white), the more commonly shared gene is within the strains If certain genes were observed in at least ≥ 99% of strains, the genes were categorized into core genomes (white). The ratio reveals the composition of the gene-pool structure.
Figure 3
Figure 3
Comparison of the results of each estimation method according to cross-validation. The boxplots illustrate the RMSEs of four different models among the seven species. Using a randomly selected sample order, 10-fold cross-validation was conducted for the Heaps’ law and the proposed model (M1, M2, and M3). The total number of genomes is randomly divided into even 10 subgroups and the estimation model is built with nine of the subgroup data. By measuring the RMSEs between the pan-genome size estimated by the candidate model and the actually observed pan-genome size, the accuracy of each model was evaluated. In all species, the M3 model outperformed the others upon comparison of the median RMSEs.
Figure 4
Figure 4
Relationship between genome size and model parameters. Heaps’ law and model M3 that showed the best results according to 10-fold cross-validation were used for comparison. The Y-axis shows the model parameters: (A) Heaps’ law parameter (γ); (B) Sum of β1, which is the coefficient of the tree distance, and β2, the genome coefficient in the proposed model (M3). Each color indicates the seven species. In Heaps’ law, regardless of the species, the parameters showed two patterns that simply increase or decrease, on the other hand, in M3 model, the parameters showed different patterns depending on the species.

References

    1. Bosi E., Monk J. M., Aziz R. K., Fondi M., Nizet V., Palsson B. O. (2016). Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity. Proc. Natl. Acad. Sci. U.S.A. 113 E3801–E3809. 10.1073/pnas.1523199113 - DOI - PMC - PubMed
    1. Chan A. P., Sutton G., DePew J., Krishnakumar R., Choi Y., Huang X. Z., et al. (2015). A novel method of consensus pan-chromosome assembly and large-scale comparative analysis reveal the highly flexible pan-genome of Acinetobacter baumannii. Genome Biol. 16:143. 10.1186/s13059-015-0701-6 - DOI - PMC - PubMed
    1. Chen S. L., Hung C.-S., Xu J., Reigstad C. S., Magrini V., Sabo A., et al. (2006). Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: a comparative genomics approach. Proc. Natl. Acad. Sci. U.S.A. 103 5977–5982. 10.1073/pnas.0600938103 - DOI - PMC - PubMed
    1. Chun B. H., Kim K. H., Jeon H. H., Lee S. H., Jeon C. O. (2017). Pan-genomic and transcriptomic analyses of Leuconostoc mesenteroides provide insights into its genomic and metabolic features and roles in kimchi fermentation. Sci. Rep. 7:11504. 10.1038/s41598-017-12016-z - DOI - PMC - PubMed
    1. Deng X., Phillippy A. M., Li Z., Salzberg S. L., Zhang W. (2010). Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC Genomics 11:500. 10.1186/1471-2164-11-500 - DOI - PMC - PubMed

LinkOut - more resources