Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep;7(9):000651.
doi: 10.1099/mgen.0.000651.

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Affiliations

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Geneviève Labbé et al. Microb Genom. 2021 Sep.

Abstract

Hierarchical genotyping approaches can provide insights into the source, geography and temporal distribution of bacterial pathogens. Multiple hierarchical SNP genotyping schemes have previously been developed so that new isolates can rapidly be placed within pre-computed population structures, without the need to rebuild phylogenetic trees for the entire dataset. This classification approach has, however, seen limited uptake in routine public health settings due to analytical complexity and the lack of standardized tools that provide clear and easy ways to interpret results. The BioHansel tool was developed to provide an organism-agnostic tool for hierarchical SNP-based genotyping. The tool identifies split k-mers that distinguish predefined lineages in whole genome sequencing (WGS) data using SNP-based genotyping schemes. BioHansel uses the Aho-Corasick algorithm to type isolates from assembled genomes or raw read sequence data in a matter of seconds, with limited computational resources. This makes BioHansel ideal for use by public health agencies that rely on WGS methods for surveillance of bacterial pathogens. Genotyping results are evaluated using a quality assurance module which identifies problematic samples, such as low-quality or contaminated datasets. Using existing hierarchical SNP schemes for Mycobacterium tuberculosis and Salmonella Typhi, we compare the genotyping results obtained with the k-mer-based tools BioHansel and SKA, with those of the organism-specific tools TBProfiler and genotyphi, which use gold-standard reference-mapping approaches. We show that the genotyping results are fully concordant across these different methods, and that the k-mer-based tools are significantly faster. We also test the ability of the BioHansel quality assurance module to detect intra-lineage contamination and demonstrate that it is effective, even in populations with low genetic diversity. We demonstrate the scalability of the tool using a dataset of ~8100 S. Typhi public genomes and provide the aggregated results of geographical distributions as part of the tool's output. BioHansel is an open source Python 3 application available on PyPI and Conda repositories and as a Galaxy tool from the public Galaxy Toolshed. In a public health context, BioHansel enables rapid and high-resolution classification of bacterial pathogens with low genetic diversity.

Keywords: SNP; bacterial typing; contamination detection; genotyping; k-mer; software.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Phylogenetic representation of a BioHansel-compatible hierarchical SNP genotyping scheme based on genome-wide variant positions. Samples A and B belong to the same parent genotype 1 so they contain the same defining SNP at position 50. The other genotyping SNPs are exclusive to their corresponding type. Genotyping split k-mers for BioHansel are derived by extracting the sequence from the reference sequence around the variant position. The positive k-mer is used to define the presence of a genotype level and should only be found in members of a genotype. The negative k-mer would be present in members which are not part of the genotype. A BioHansel scheme uses the genome position of the variant as the unique ID for the split k-mer pair combined with the pair’s corresponding genotype.
Fig. 2.
Fig. 2.
BioHansel genotyping workflow. Query sequence data in fasta or fastq format are provided to the tool with a corresponding scheme and optional scheme metadata table. The scheme provides a set of k-mers and places them in a hierarchy. BioHansel searches for the specified k-mers in the query data. Fastq data are filtered based on coverage to remove low-abundance k-mers. BioHansel examines all the identified positive k-mers to find the most resolved genotype, the deepest in the hierarchy, which will be the overall genotyping call. The genotyping results are then evaluated through the QA/QC module to determine if a sample has an adequate number of scheme k-mers to consider the result as reliable. The identified k-mers are examined for consistency with the scheme hierarchy. In the current example, where the dataset possesses the DNA bases A, G, T, G and A in the five target positions defined in the genotyping scheme (see also Fig. 1), a genotype designation of 1.1 is consistent with the hierarchy, since the positive k-mers for both genotype 1 and genotype 1.1 would be present. A sample would be inconsistent if any of the parent genotype k-mers were missing or if the negative version of the k-mer was present. The QA/QC module in BioHansel also can identify intra-strain contamination by looking for the presence of both positive and negative versions of the same k-mer, or the presence of positive k-mers from other genotypes. In the current example, if the positive k-mer for genotype 2 was also identified, it would indicate a contaminated sample.
Fig. 3.
Fig. 3.
Maximum-likelihood phylogenetic trees of benchmarking isolates representing the diversity of genotypes defined in the MTB (a) and Typhi (b) schemes. Each tree is labelled such that each isolate is labelled with its genotype and a colour representing the first level of the corresponding scheme. Bars, approximatively 120 SNPs (a) and 38 SNPs (b), indicating that the genetic diversity of the S. Typhi population represented by these genotypes is approx. 3× lower than that of MTB.
Fig. 4.
Fig. 4.
Boxplots of the runtime (a) and peak memory usage (b) comparison of four tools, BioHansel, Genotyphi, SKA and TBProfiler, on synthetic Illumina fastq data with a fixed coverage of 50×. BioHansel and SKA results are based on datasets representing both MTB and Typhi schemes (N=129) while genotyphi and TBProfiler results are only based on datasets representing either the Typhi (N=67) or the MTB (N=62) scheme, respectively.
Fig. 5.
Fig. 5.
Bar plot of contamination detection of BioHansel and ConFindr using datasets with different levels of contamination (1–25× coverage depth of contaminant genotype) in a fixed level of 50× Illumina genome coverage depth. Results are aggregated for 852 MTB and 168 Typhi pair-wise combinations where both ConFindr and BioHansel could detect contamination at the 25× coverage depth (50% contamination).
Fig. 6.
Fig. 6.
Balloon plot of 7943 global S. Typhi isolates showing the associations between genotypes and geography at the level of continents. The size of a point indicates the number of samples and the colour of a point indicates the number of discrete countries contained within it.

References

    1. Deurenberg RH, Bathoorn E, Chlebowicz MA, Couto N, Ferdous M, et al. Application of next generation sequencing in clinical microbiology and infection prevention. J Biotechnol. 2017;243:16–24. doi: 10.1016/j.jbiotec.2016.12.022. - DOI - PubMed
    1. Nadon C, Walle V, Gerner-Smidt P, Campos J, Chinen I, et al. PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance. Euro Surveill Bull Eur Sur Mal Transm Eur Commun Dis Bull. 2017;22:30544. - PMC - PubMed
    1. Wong VK, Baker S, Connor TR, Pickard D, Page AJ, et al. An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid. Nat Commun. 2016;7:12827. doi: 10.1038/ncomms12827. - DOI - PMC - PubMed
    1. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014;5:4812. doi: 10.1038/ncomms5812. - DOI - PMC - PubMed
    1. Harris SR. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv. 2018:453142.

Publication types

Supplementary concepts

LinkOut - more resources