Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 31:8:1345.
doi: 10.3389/fmicb.2017.01345. eCollection 2017.

Pan-genome Analyses of the Species Salmonella enterica, and Identification of Genomic Markers Predictive for Species, Subspecies, and Serovar

Affiliations

Pan-genome Analyses of the Species Salmonella enterica, and Identification of Genomic Markers Predictive for Species, Subspecies, and Serovar

Chad R Laing et al. Front Microbiol. .

Abstract

Food safety is a global concern, with upward of 2.2 million deaths due to enteric disease every year. Current whole-genome sequencing platforms allow routine sequencing of enteric pathogens for surveillance, and during outbreaks; however, a remaining challenge is the identification of genomic markers that are predictive of strain groups that pose the most significant health threats to humans, or that can persist in specific environments. We have previously developed the software program Panseq, which identifies the pan-genome among a group of sequences, and the SuperPhy platform, which utilizes this pan-genome information to identify biomarkers that are predictive of groups of bacterial strains. In this study, we examined the pan-genome of 4893 genomes of Salmonella enterica, an enteric pathogen responsible for the loss of more disability adjusted life years than any other enteric pathogen. We identified a pan-genome of 25.3 Mbp, a strict core of 1.5 Mbp present in all genomes, and a conserved core of 3.2 Mbp found in at least 96% of these genomes. We also identified 404 genomic regions of 1000 bp that were specific to the species S. enterica. These species-specific regions were found to encode mostly hypothetical proteins, effectors, and other proteins related to virulence. For each of the six S. enterica subspecies, markers unique to each were identified. No serovar had pan-genome regions that were present in all of its genomes and absent in all other serovars; however, each serovar did have genomic regions that were universally present among all constituent members, and statistically predictive of the serovar. The phylogeny based on SNPs within the conserved core genome was found to be highly concordant to that produced by a phylogeny using the presence/absence of 1000 bp regions of the entire pan-genome. Future studies could use these predictive regions as components of a vaccine to prevent salmonellosis, as well as in simple and rapid diagnostic tests for both in silico and wet-lab applications, with uses ranging from food safety to public health. Lastly, the tools and methods described in this study could be applied as a pan-genomics framework to other population genomic studies seeking to identify markers for other bacterial species and their sub-groups.

Keywords: Salmonella; food safety; genomics; pan-genome; predictive markers.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
The distribution of the Salmonella enterica pan-genome, as 1000 bp fragments, among 4939 whole-genome sequences (WGSs).
FIGURE 2
FIGURE 2
The carriage of the 404 S. enterica species-specific regions among each of the 4939 genomes of this study. Each dot represents a single S. enterica genome, which are arranged in order from those that contain the fewest species-specific regions to those that contain the most.
FIGURE 3
FIGURE 3
The carriage of the 404 S. enterica species-specific regions, versus the number of contigs for each of the 4936 genomes. Colors indicate the subspecies within S. enterica as follows: red: arizonae, lime: diarizonae, teal: enterica, blue: houtenae, lavender: indica, magenta: salamae and yellow: sample with Citrobacter contamination.
FIGURE 4
FIGURE 4
The phylogeny of the 4893 S. enterica genomes post quality-filtering, and limiting the number of genomes from each serovar to five. The name of each serovar is presented as text, and the six subspecies are shown as colored circles as follows: teal: arizonae, blue: diarizonae, dark orange: enterica, peach: houtenae, dark green: indica, light orange: salamae.
FIGURE 5
FIGURE 5
The phylogeny of the 4893 S. enterica genomes post quality-filtering based on SNPs found within the conserved core genome. The 10 most abundant serovars of subspecies enterica in the current study (Agona, Bareilly, Enteritidis, Heidelberg, Kentucky, Newport, Paratyphi, Typhi, Typhimurium, Weltevreden) are labeled on the tree. The matrix to the right of the phylogeny represents the 404 species-specific regions, with blue being the absence of a region, and green being the presence of a region, for each of the genomes of the study.
FIGURE 6
FIGURE 6
The phylogeny of the 4893 S. enterica genomes post quality-filtering based on the presence/absence of the entire pan-genome as 1000 bp fragments. The 10 most abundant serovars of subspecies enterica in the current study (Agona, Bareilly, Enteritidis, Heidelberg, Kentucky, Newport, Paratyphi, Typhi, Typhimurium, Weltevreden) are labeled on the tree. The matrix to the right of the phylogeny represents the 404 species-specific regions, with blue being the absence of a region, and green being the presence of a region, for each of the genomes of the study.
FIGURE 7
FIGURE 7
The number of predictive markers from the GenBank dataset found within the EnteroBase dataset for nine serovars of S. enterica, which encompassed a test set of 3948 genomes. The number of genomes for each serovar was the same between the GenBank and EnteroBase datasets, as shown in Table 6. The size of the circles is proportional to the number of predictive markers from the GenBank dataset found in the EnteroBase dataset. The number of genomes for each serovar is given in the horizontal axis label. Using serovar Agona as an example, there were 136 genomes in both the GenBank and EnteroBase datasets, and 129 of the 161 predictive markers from the GenBank dataset were found in all of the genomes from the EnteroBase dataset, whereas 21 of the GenBank predictive markers were found in all but one (135) of the EnteroBase genomes examined.

Similar articles

Cited by

References

    1. Aanensen D. M., Feil E. J., Holden M. T. G., Dordel J., Yeats C. A., Fedosejev A., et al. (2016). Whole-genome sequencing for routine pathogen surveillance in public health: a population snapshot of invasive Staphylococcus aureus in Europe. mBio 7:e00444–16 10.1128/mBio.00444-16 - DOI - PMC - PubMed
    1. Allard M. W., Luo Y., Strain E., Li C., Keys C. E., Son I., et al. (2012). High resolution clustering of Salmonella enterica serovar Montevideo strains using a next-generation sequencing approach. BMC Genomics 13:32 10.1186/1471-2164-13-32 - DOI - PMC - PubMed
    1. Ashton P. M., Nair S., Peters T. M., Bale J. A., Powell D. G., Painset A., et al. (2016). Identification of Salmonella for public health surveillance using whole genome sequencing. PeerJ 4:e1752 10.7717/peerj.1752 - DOI - PMC - PubMed
    1. Babenko D., Azizov I., Toleman M. (2016). wgMLST as a standardized tool for assessing the quality of genome assembly data. Int. J. Infect. Dis. 45:329 10.1016/j.ijid.2016.02.714 - DOI
    1. Bergholz T. M., Moreno Switt A. I., Wiedmann M. (2014). Omics approaches in food safety: Fulfilling the promise? Trends Microbiol. 22 275–281. 10.1016/j.tim.2014.01.006 - DOI - PMC - PubMed

LinkOut - more resources