Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 1:5:223.
doi: 10.12688/wellcomeopenres.16291.2. eCollection 2020.

Genomic diversity of Salmonella enterica - The UoWUCC 10K genomes project

Affiliations

Genomic diversity of Salmonella enterica - The UoWUCC 10K genomes project

Mark Achtman et al. Wellcome Open Res. .

Abstract

Background: Most publicly available genomes of Salmonella enterica are from human disease in the US and the UK, or from domesticated animals in the US. Methods: Here we describe a historical collection of 10,000 strains isolated between 1891-2010 in 73 different countries. They encompass a broad range of sources, ranging from rivers through reptiles to the diversity of all S. enterica isolated on the island of Ireland between 2000 and 2005. Genomic DNA was isolated, and sequenced by Illumina short read sequencing. Results: The short reads are publicly available in the Short Reads Archive. They were also uploaded to EnteroBase, which assembled and annotated draft genomes. 9769 draft genomes which passed quality control were genotyped with multiple levels of multilocus sequence typing, and used to predict serovars. Genomes were assigned to hierarchical clusters on the basis of numbers of pair-wise allelic differences in core genes, which were mapped to genetic Lineages within phylogenetic trees. Conclusions: The University of Warwick/University College Cork (UoWUCC) project greatly extends the geographic sources, dates and core genomic diversity of publicly available S. enterica genomes. We illustrate these features by an overview of core genomic Lineages within 33,000 publicly available Salmonella genomes whose strains were isolated before 2011. We also present detailed examinations of HC400, HC900 and HC2000 hierarchical clusters within exemplar Lineages, including serovars Typhimurium, Enteritidis and Mbandaka. These analyses confirm the polyphyletic nature of multiple serovars while showing that discrete clusters with geographical specificity can be reliably recognized by hierarchical clustering approaches. The results also demonstrate that the genomes sequenced here provide an important counterbalance to the sampling bias which is so dominant in current genomic sequencing.

Keywords: High throughput sequencing; Large scale genomic database; Population genomics; Salmonella.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Sources of bacterial isolates for the 10K UoWUCC Salmonella Genomes Project.
A) Semi-logarithmic histogram of numbers of genomes in EnteroBase by year of isolation. Genomes from the 10K project with known dates of isolation are shown in blue and other Salmonella genomes in yellow. Inset: Genomes which were isolated between 1990 and 2010. B) Geographic distribution of sources of isolation. Dot circles are proportional to numbers of strains as indicated in the Key legend at the lower right. Inset: Expanded map of the region near the English Channel.
Figure 2.
Figure 2.. Quality control of 10K genomes.
Default EnteroBase criteria are indicated by vertical dashed lines. Numbers of genomes in the 10K project which passed these cut-off criteria are indicated in blue and failures in yellow, with the total numbers of failures near the tops of the figures in yellow. The quality criteria consisted of N50 ≥20,000, genomic assembly size between 4 MB and 5.8 MB, a maximum of 600 contigs and a low fraction of uncalled, low quality bases (N’s).
Figure 3.
Figure 3.. Genomic diversity of 33,052 pre-2011 genomes in EnteroBase, including 9206 from the 10K genome project (red perimeters).
The figure shows a Ninja NJ ( Wheeler, 2009) tree of the numbers of different alleles between cgSTs as generated within EnteroBase using GrapeTree ( Zhou et al., 2018a). Nodes from 41 common HC900 clusters are indicated by distinct colors, HC900 designations and predominant serovars. Lineages of HC900 clusters are indicated in yellow. The Enteritidis and Typhimurium Lineages are explored in greater detail in Figure 4 and the Mbandaka Lineage in Figure 5. Node sizes are proportional to the numbers of genomes they include. Nodes that include genomes from the 10K genomes project are highlighted by red perimeter. An interactive version can be found at http://enterobase.warwick.ac.uk/a/46053, in which the user can use other metadata for coloring genomes. Scale bar: 300 alleles.
Figure 4.
Figure 4.
Detailed representations of HC 2000 and 900 clusters in the Typhimurium Lineage ( A) and the Enteritidis Lineage ( B). Each consists of a NINJA NJ tree of the subset of nodes encompassed by the corresponding Lineages from the tree in Figure 3. The figure indicates HC2000 clusters in larger font and gray shading. Designations for individual HC900 clusters and their predominant serovar include the total number of isolates (black) and the number from the 10K genomes project (red) in parentheses. In part B, Clade A and C designations from citations ( Graham et al., 2018; Luo et al., 2020) are indicated for HC900_3589 and HC2000_1570, respectively. Interactive versions can be found at http://enterobase.warwick.ac.uk/a/46227 ( A) and http://enterobase.warwick.ac.uk/a/46226 ( B), in which the user can use other metadata for coloring genomes. Black arrowheads: tree root. Scale bar: 200 alleles.
Figure 5.
Figure 5.. Genomic diversity of 601 pre-2011 genomes from HC100_4 of which 208 were from the 10K genomes project (red perimeters).
The figure shows a Ninja NJ ( Wheeler, 2009) tree of the numbers of different alleles between cgSTs as generated within EnteroBase using GrapeTree ( Zhou et al., 2018a). The geographical sources of some of the isolates from the 10K genomes project are indicated to demonstrate that multiple micro-clades were present in individual countries. An interactive version can be found at http://enterobase.warwick.ac.uk/a/46139, in which the user can use other metadata for coloring genomes. The same tree colored by general source can be found in Figure 6 and a tree showing all modern Mbandaka and Lubbock genomes can be found in Figure 7. Scale bar: 10 alleles. Color Key at right.
Figure 6.
Figure 6.. As Figure 5, except that the nodes are colored by general source.
Figure 7.
Figure 7.. Genomic diversity of 2955 genomes from HC100_4 from EnteroBase (18/08/2020) of which 208 were from the 10K genomes project (red perimeters).
The figure shows a Ninja NJ ( Wheeler, 2009) tree of the numbers of different alleles between cgSTs as generated within EnteroBase using GrapeTree ( Zhou et al., 2018a). The geographical sources of all isolates are color-coded (Key at lower left) and the location of serovar Lubbock is shaded. Unshaded isolates are serovar Mbandaka. An interactive version can be found at http://enterobase.warwick.ac.uk/a/46122, in which the user can use other metadata for coloring genomes. Scale bar: 10 alleles.

References

    1. Achtman M, Hale J, Murphy RA, et al. : Population structures in the SARA and SARB reference collections of Salmonella enterica according to MLST, MLEE and microarray hybridization. Infect Genet Evol. 2013;16C:314–325. 10.1016/j.meegid.2013.03.003 - DOI - PubMed
    1. Achtman MA, Wain J, Weill FX, et al. : Multilocus sequence typing as a replacement for serotyping in Salmonella enterica. PLoS Pathog. 2012;8(6):e1002776. 10.1371/journal.ppat.1002776 - DOI - PMC - PubMed
    1. Alikhan NF, Zhou Z, Sergeant MJ, et al. : A genomic overview of the population structure of Salmonella. PLoS Genet. 2018;14(4):e1007261. 10.1371/journal.pgen.1007261 - DOI - PMC - PubMed
    1. Andrews-Polymenis HL, Rabsch W, Porwollik S, et al. : Host restriction of Salmonella enterica serotype Typhimurium pigeon isolates does not correlate with loss of discrete genes. J Bacteriol. 2004;186(9):2619–2628. 10.1128/jb.186.9.2619-2628.2004 - DOI - PMC - PubMed
    1. Ashton PM, Nair S, Peters TM, et al. : Identification of Salmonella for public health surveillance using whole genome sequencing. PeerJ. 2016;4:e1752. 10.7717/peerj.1752 - DOI - PMC - PubMed

LinkOut - more resources