Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan;30(1):138-152.
doi: 10.1101/gr.251678.119. Epub 2019 Dec 6.

The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity

Collaborators

The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity

Zhemin Zhou et al. Genome Res. 2020 Jan.

Abstract

EnteroBase is an integrated software environment that supports the identification of global population structures within several bacterial genera that include pathogens. Here, we provide an overview of how EnteroBase works, what it can do, and its future prospects. EnteroBase has currently assembled more than 300,000 genomes from Illumina short reads from Salmonella, Escherichia, Yersinia, Clostridioides, Helicobacter, Vibrio, and Moraxella and genotyped those assemblies by core genome multilocus sequence typing (cgMLST). Hierarchical clustering of cgMLST sequence types allows mapping a new bacterial strain to predefined population structures at multiple levels of resolution within a few hours after uploading its short reads. Case Study 1 illustrates this process for local transmissions of Salmonella enterica serovar Agama between neighboring social groups of badgers and humans. EnteroBase also supports single nucleotide polymorphism (SNP) calls from both genomic assemblies and after extraction from metagenomic sequences, as illustrated by Case Study 2 which summarizes the microevolution of Yersinia pestis over the last 5000 years of pandemic plague. EnteroBase can also provide a global overview of the genomic diversity within an entire genus, as illustrated by Case Study 3, which presents a novel, global overview of the population structure of all of the species, subspecies, and clades within Escherichia.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of EnteroBase Features. (A) Data uploads. Data are imported from public databases, user uploads, and existing legacy MLST and rMLST databases at PubMLST (https://pubmlst.org/). (B) Spreadsheet Interface. The browser-based interface visualizes sets of strains (one Uberstrain plus any number of substrains) each containing metadata, and their associated experimental data and custom views. Post-release data can be exported (downloaded) as genome assemblies or tab-delimited text files containing metadata and experimental data. Metadata can be imported to entries for which the user has editing rights by uploading tab-delimited text files. (C) Search Strains supports flexible (AND/OR) combinations of metadata and experimental data for identifying entries to load into the spreadsheet. Find ST(s) retrieves STs that differ from a given ST by no more than a maximal number of differing alleles. Locus Search uses BLASTN (Altschul et al. 1990) and UBlastP in USEARCH (Edgar 2010) to identify the MLST locus designations corresponding to an input sequence. Get at this level: menu item after right clicking on experimental MLST ST or cluster numbers. (D) UserSpace OS. A file explorer–like interface for manipulations of workspaces, trees, SNP projects, and custom views. These objects are initially private to their creator but can be shared with buddies or rendered globally accessible. (E) Processes and analyses. EnteroBase uses EToKi and external programs as described in Supplemental Figure S1. (F) Visualization. MLST trees are visualized with the EnteroBase tools GrapeTree (Zhou et al. 2018a) and Dendrogram, which in turn can transfer data to external websites such as Microreact (Argimón et al. 2016).
Figure 2.
Figure 2.
The hierarchical cgMLST clustering (HierCC) scheme in EnteroBase. (A) A screenshot of Salmonella cgMLST V2 plus HierCC V1 data for five randomly selected genomes. The numbers in the columns are the HierCC cluster numbers. Cluster numbers are the smallest cgMLST ST number in single-linkage clusters of pairs of STs that are joined by up to the specified maximum number of allelic differences. These maximum differences are indicated by the suffix of each HC column, starting with HC0 for 0 cgMLST allelic differences, other than missing data, through to HC2850 for 2850 allelic differences. The cluster assignments are greedy because individual nodes which are equidistant from multiple clusters are assigned to the cluster with the smallest cluster number. (B) Interpretation of HierCC numbers. The assignments of genomic cgMLST STs to HC levels can be used to assess their genomic relatedness. The top two genomes are both assigned to HC10_306, which indicates a very close relationship, and may represent a transmission chain. The top three genomes are all assigned to HC900_2, which corresponds to a legacy MLST eBG. HC2000 marks superlineages (Zhou et al. 2018c), and HC2850 marks subspecies. This figure illustrates these interpretations in the form of a cladogram drawn by hand.
Figure 3.
Figure 3.
Serovar versus HierCC clustering in serovar Agama. GrapeTree (Zhou et al. 2018a) depiction of a RapidNJ tree (Simonsen et al. 2011) of cgMLST allelic distances between genomic entries whose metadata Serovar field contained Agama or SISTR1 (Robertson et al. 2018) Serovar predictions contained Agama. (A) Color coding by Predicted Serovar (SISTR1). Arrows indicate isolates whose serovar was not predicted. Orange shading emphasizes 1,4,[5],12:i:- isolates that were monophasic Agama. Gray shading indicates isolates with incorrect Serovar metadata, including 1,4,[5],12:i:- isolates that were monophasic Typhimurium (arrow). (B) Color coding by HC2000 cluster. All Agama entries are HC2000_299, as were the genetically related entries marked with arrows or emphasized by orange shading. Entries from other serovars (gray shading) were in other diverse HC2000 clusters. The dashed box indicates a subset of Agama strains within HC400_299, including all isolates from badgers, which were chosen for deeper analyses in Figure 4. (Scale bar) Number of cgMLST allelic differences.
Figure 4.
Figure 4.
Effects of sample bias on inferred transmission chains within HC400_299 Agama isolates. (A, left) Map of hosts in the British Isles of 149 Agama isolates in EnteroBase in August, 2018. (Right) Maximum-likelihood radial phylogeny (http://enterobase.warwick.ac.uk/a/21773/d) based on RAxML (Stamatakis 2014) of 8791 nonrepetitive core SNPs as calculated by EnteroBase Dendrogram against reference genome 283179. Color coding is according to a user-defined field (Location & Source). HC100 cluster designations for three microclades are indicated. HC100_2433 contained all Agama from badgers. (B, right) Summary of hosts and countries from which 64 additional Agama isolates had been sequenced by March 2019. (Left) Maximum-likelihood radial dendrogram (http://enterobase.warwick.ac.uk/a/23882/d) based on 9701 SNPs from 213 isolates. Multiple isolates of Agama in HC100_2433 were now from humans and food in France and Austria. HC100_299 and HC100_67355 now contained multiple isolates from badgers, livestock, companion animals, and mussels, demonstrating that the prior strong association of Agama with humans and badgers in A reflected sample bias. Stars indicate multiple MRCAs of Agama in English badgers, whereas the pink arrow indicates a potential transmission from badgers to a human in Bath/North East Somerset, which is close to Woodchester Park. The green arrow indicates a potential food-borne transmission chain consisting of four closely related Agama isolates in HC5_140035 from Austria (chives × 2; human blood culture × 1) and France (human × 1) that were isolated in 2018. The geographical locations of the badger isolates are shown in Supplemental Figure S5.
Figure 5.
Figure 5.
Maximum-likelihood tree of modern and ancient genomes of Y. pestis. EnteroBase contained 1368 ancient and modern Y. pestis genomes in October 2019, of which several hundred genomes that had been isolated in Madagascar and Brazil over short time periods showed very low levels of genomic diversity. To reduce this sample bias, the data set used for analysis included only one random representative from each HC0 group from those two countries, leaving a total of 622 modern Y. pestis genomes. Fifty-six ancient genomes of Y. pestis from existing publications were assembled with EToKi (Methods), resulting in a total of 678 Y. pestis genomes plus Yersinia pseudotuberculosis IP32953 as an outgroup (http://enterobase.warwick.ac.uk/a/21975). The EnteroBase pipelines (Supplemental Fig. S2D) were used to create a SNP project in which all genomes were aligned against CO92 (2001) using LASTAL. The SNP project identified 23,134 nonrepetitive SNPs plus 7534 short inserts/deletions over 3.8 Mbps of core genomic sites which had been called in ≥95% of the genomes. In this figure, nodes are color coded by population designations for Y. pestis according to published sources (Morelli et al. 2010; Cui et al. 2013; Achtman 2016), except for 0.PE8 which was assigned to a genome from 1918 to 1754 BCE (Spyrou et al. 2018). The designation 0.ANT4 was applied by Achtman (2016) to Y. pestis from the Justinianic plague described by Wagner et al. (2014), and that designation was also used for a genome associated with the Justinianic plague (DA101) that was later described by Damgaard et al. (2018) as 0.PE5.
Figure 6.
Figure 6.
Neighbor-joining (RapidNJ) tree of core genome allelic distances in the EcoRPlus Collection of 9479 genomes. EcoRPlus includes the draft genome with the greatest N50 value from each of the 9479 rSTs among 52,876 genomes of Escherichia within EnteroBase (August 2018) (http://enterobase.warwick.ac.uk/a/15931). The nodes in this tree are color coded by HC1100 clusters, as indicated in the key at the bottom left. Common HC1100 clusters (plus the corresponding ST Complexes) are indicated at the circumference of the tree. These are largely congruent, except that HC1100_13 corresponds to ST10 Complex plus ST168 Complex, and other discrepancies exist among the smaller, unlabeled populations. See Supplemental Figures S7, S8, respectively, for color coding by ST Complex and Clermont typing. An interactive version in which the nodes can be freely color coded by all available metadata is available at http://enterobase.warwick.ac.uk/a/15981. A maximum-likelihood tree based on SNP differences can be found in Supplemental Figure S9.

References

    1. Achtman M. 2016. How old are bacterial pathogens? Proc Biol Sci 283: 20160990 10.1098/rspb.2016.0990 - DOI - PMC - PubMed
    1. Achtman M, Zhou Z. 2014. Distinct genealogies for plasmids and chromosome. PLoS Genet 10: e1004874 10.1371/journal.pgen.1004874 - DOI - PMC - PubMed
    1. Achtman M, Zhou Z. 2019. Analysis of the human oral microbiome from modern and historical samples with SPARSE and EToKi. bioRxiv 10.1101/842542 - DOI
    1. Achtman M, Wain J, Weill FX, Nair S, Zhou Z, Sangal V, Krauland MG, Hale JL, Harbottle H, Uesbeck A, et al. 2012. Multilocus sequence typing as a replacement for serotyping in Salmonella enterica. PLoS Pathog 8: e1002776 10.1371/journal.ppat.1002776 - DOI - PMC - PubMed
    1. Ahlstrom CA, Bonnedahl J, Woksepp H, Hernandez J, Olsen B, Ramey AM. 2018. Acquisition and dissemination of cephalosporin-resistant E. coli in migratory birds sampled at an Alaska landfill as inferred through genomic analysis. Sci Rep 8: 7361 10.1038/s41598-018-25474-w - DOI - PMC - PubMed

Publication types

LinkOut - more resources