Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 11;19(1):222.
doi: 10.1186/s12859-018-2225-z.

Variant site strain typer (VaST): efficient strain typing using a minimal number of variant genomic sites

Affiliations

Variant site strain typer (VaST): efficient strain typing using a minimal number of variant genomic sites

Tara N Furstenau et al. BMC Bioinformatics. .

Abstract

Background: Targeted PCR amplicon sequencing (TAS) techniques provide a sensitive, scalable, and cost-effective way to query and identify closely related bacterial species and strains. Typically, this is accomplished by targeting housekeeping genes that provide resolution down to the family, genera, and sometimes species level. Unfortunately, this level of resolution is not sufficient in many applications where strain-level identification of bacteria is required (biodefense, forensics, clinical diagnostics, and outbreak investigations). Adding more genomic targets will increase the resolution, but the challenge is identifying the appropriate targets. VaST was developed to address this challenge by finding the minimum number of targets that, in combination, achieve maximum strain-level resolution for any strain complex. The final combination of target regions identified by the algorithm produce a unique haplotype for each strain which can be used as a fingerprint for identifying unknown samples in a TAS assay. VaST ensures that the targets have conserved primer regions so that the targets can be amplified in all of the known strains and it also favors the inclusion of targets with basal variants which makes the set more robust when identifying previously unseen strains.

Results: We analyzed VaST's performance using a number of different pathogenic species that are relevant to human disease outbreaks and biodefense. The number of targets required to achieve full resolution ranged from 20 to 88% fewer sites than what would be required in the worst case and most of the resolution is achieved within the first 20 targets. We computationally and experimentally validated one of the VaST panels and found that the targets led to accurate phylogenetic placement of strains, even when the strains were not a part of the original panel design.

Conclusions: VaST is an open source software that, when provided a set of variant sites, can find the minimum number of sites that will provide maximum resolution of a strain complex, and it has many different run-time options that can accommodate a wide range of applications. VaST can be an effective tool in the design of strain identification panels that, when combined with TAS technologies, offer an efficient and inexpensive strain typing protocol.

Keywords: Bacterial strain typing; Single nucleotide polymorphisms; Targeted PCR Amplicon sequencing.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Competing interests

TNF, JWS, and VYF declare that they have applied for a patent for the truncated Y. pestis primer panel.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
VaST Pipeline Schematic. a Overview of the VaST pipeline. b The window (gray box) starts at the first site (115) and captures two additional sites (120 and 121). The amplicon (black box) extends from the first to the last variant site in the window and the primer zones (arrows) extend in opposite directions. c The primer zone region is extracted from the full genome matrix and the number of strains that are missing data (X) or have a base call that differs from the reference are counted for each position. d A position in the primer zone is flagged (!) when the number of poorly conserved strains is greater than or equal to the strain cutoff value. e To pass the filter in this example, 20% of the primer zone positions must be a member of a conserved segment that is longer than three positions. f The table shows the variant sequence features of the amplicons g The resolution pattern of each amplicon is determined and the amplicons that contain redundant information are combined (e.g. Amplicon 3 & 4 into Pattern 3). For ambiguous (N) or missing calls (X), all of the possibilities are enumerated and the strain simultaneously belongs to all of the feature categories that overlap with those of the other strains. The bottom row is the resolution score, r, for each pattern. The minimum spanning set algorithm favors patterns that evenly split up groups of strains. Using SNPs as an example, h is the best case scenario where N strains can be resolved with log4(N) SNPs; however, i log2(N) is more likely with bi-allelic SNPs. j In the worst case, highly unbalanced splitting can occur which can require at most N−1 SNPs to resolve N strains. k The associated haplotypes for each of the minimum spanning sets in (h-j)
Fig. 2
Fig. 2
Most of the resolution is achieved within the first few targets. Minimum spanning sets were generated for strains of Bacillus anthracis, Burkholderia pseudomallei, Escherichia coli, Francisella tularensis, Staphylococcus aureus, and Yersinia pestis. The plot shows how the resolution index (Nstrains− average group size±SD) increases with each additional site.The number of differentiable strains included in the panel design and the size of the minimum spanning set is indicated next to each plot. The dashed vertical lines indicate the number of sites expected in the worst-case (N−1 sites)
Fig. 3
Fig. 3
The redundancy built into the minimum spanning set design makes it tolerant to missing sites. The plot shows how well the Yersinia pestis minimum spanning set tolerates missing sites. The x-axis is the number of missing sites and the y-axis is the expected resolution index. Each box-plot shows the distribution of resolution values for different panels (N=50) with 1 to 20 sites randomly removed. The resolution index of the full panel is 269 and the median resolution when 20 sites are missing is 267.9 — a difference of only 1.1
Fig. 4
Fig. 4
VaST identifies more targets than a traditional MLST and provides greater strain resolution. The neighbor joining tree was built using 5,000 SNPs from 159 strains of Staphylococcus aureus. The colors in the heatmap represent different strain groups ranging from 1-138. The MLST loci only resolved 41 groups as indicated by the smaller range of colors compared to VaST which resolved 138 groups
Fig. 5
Fig. 5
The Y. pestis samples were correctly identified using the target sites identified by VaST. The placement and resolution of the sample strains on a neighbor joining tree produced using the full SNP matrix (11,249 SNPs). The group of strains indicated for each sample represent the strains that were most similar to the sample strain at each of the targets analyzed in the truncated panel. The branch lengths indicate the number of SNP differences

Similar articles

Cited by

References

    1. Brzuszkiewicz E, Thürmer A, Schuldes J, Leimbach A, Liesegang H, Meyer F, et al. Genome sequence analyses of two isolates from the recent Escherichia coli outbreak in Germany reveal the emergence of a new pathotype: Entero-Aggregative-Haemorrhagic Escherichia coli (EAHEC). Arch Microbiol. 2011; 193(12):883–91. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219860/. - PMC - PubMed
    1. Deng X, den Bakker HC, Hendriksen RS. Genomic Epidemiology: Whole-Genome-Sequencing Powered Surveillance and Outbreak Investigation of Foodborne Bacterial Pathogens. Annu Rev Food Sci Technol. 2016; 7(1):353–74. PMID: 26772415 Available from: 10.1146/annurev-food-041715-033259. - PubMed
    1. Pires dos Santos T, Damborg P, Moodley A, Guardabassi L. Systematic Review on Global Epidemiology of Methicillin-Resistant Staphylococcus pseudintermedius: Inference of Population Structure from Multilocus Sequence Typing Data. Front Microbiol. 2016; 7:1599. Available from: https://www.frontiersin.org/article/10.3389/fmicb.2016.01599. - DOI - PMC - PubMed
    1. Rasko D, Worsham P, Abshire T, Stanley S, Bannan J, Wilson M, et al. Bacillus anthracis comparative genome analysis in support of the Amerithrax investigation. Proc Natl Acad Sci U S A. 2011;108:5027–32. doi: 10.1073/pnas.1016657108. - DOI - PMC - PubMed
    1. Schmedes SE, Sajantila A, Budowle B. Expansion of Microbial Forensics. J Clin Microbiol. 2016; 54(8):1964–74. Available from: http://jcm.asm.org/content/54/8/1964.abstract. - PMC - PubMed

Publication types