Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 6;20(1):338.
doi: 10.1186/s12864-019-5723-0.

BacTag - a pipeline for fast and accurate gene and allele typing in bacterial sequencing data based on database preprocessing

Affiliations

BacTag - a pipeline for fast and accurate gene and allele typing in bacterial sequencing data based on database preprocessing

Lusine Khachatryan et al. BMC Genomics. .

Abstract

Background: Bacteria carry a wide array of genes, some of which have multiple alleles. These different alleles are often responsible for distinct types of virulence and can determine the classification at the subspecies levels (e.g., housekeeping genes for Multi Locus Sequence Typing, MLST). Therefore, it is important to rapidly detect not only the gene of interest, but also the relevant allele. Current sequencing-based methods are limited to mapping reads to each of the known allele reference, which is a time-consuming procedure.

Results: To address this limitation, we developed BacTag - a pipeline that rapidly and accurately detects which genes are present in a sequencing dataset and reports the allele of each of the identified genes. We exploit the fact that different alleles of the same gene have a high similarity. Instead of mapping the reads to each of the allele reference sequences, we preprocess the database prior to the analysis, which makes the subsequent gene and allele identification efficient. During the preprocessing, we determine a representative reference sequence for each gene and store the differences between all alleles and this chosen reference. Throughout the analysis we estimate whether the gene is present in the sequencing data by mapping the reads to this reference sequence; if the gene is found, we compare the variants to those in the preprocessed database. This allows to detect which specific allele is present in the sequencing data. Our pipeline was successfully tested on artificial WGS E. coli, S. pseudintermedius, P. gingivalis, M. bovis, Borrelia spp. and Streptomyces spp. data and real WGS E. coli and K. pneumoniae data in order to report alleles of MLST house-keeping genes.

Conclusions: We developed a new pipeline for fast and accurate gene and allele recognition based on database preprocessing and parallel computing and performed better or comparable to the current popular tools. We believe that our approach can be useful for a wide range of projects, including bacterial subspecies classification, clinical diagnostics of bacterial infections, and epidemiological studies.

Keywords: Allele typing; Database preprocessing; Multi-locus sequence typing; Next-generation sequencing.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Since in this research no human material or clinical records of patients or volunteers were used, this research is out of scope for a medical ethical committee. This was verified by the Leiden University Medical Center Medical Ethical Committee.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Schematic representation of the database preprocessing. All of the processes are illustrated for one gene. Calculations for several genes are done independently in parallel
Fig. 2
Fig. 2
Schematic representation of the analysis part of BacTag pipeline. All of the processes are illustrated for one gene. Calculations for multiple genes are done independently in parallel. The analysis of the low similarity group of sequences is highlighted by the dashed box and can be manually turned off by the user for the time efficiency
Fig. 3
Fig. 3
The dependence of database preprocessing time from the amount of sequences in the database
Fig. 4
Fig. 4
Time required for the analysis of 30 samples belonging to the ST131 by two modes of BacTag (a and b), MLST 1.8 (c) and Enterobase (d)
Fig. 5
Fig. 5
Comparing of the processing time required for the Achtman seven genes MLST analysis of 30 WGS E. coli samples

References

    1. Konstantinidis KT, Ramette A, Tiedje JM. The bacterial species definition in the genomic era. Philos Trans R Soc Lond Ser B Biol Sci. 2006;361(1475):1929–1940. doi: 10.1098/rstb.2006.1920. - DOI - PMC - PubMed
    1. Schloter M, Lebuhn M, Heulin T, Hartmann A. Ecology and evolution of bacterial microdiversity. FEMS Microbiol Rev. 2000;24:647–660. doi: 10.1111/j.1574-6976.2000.tb00564.x. - DOI - PubMed
    1. Hartl DL, Dykhuizen DE. The population genetics of Escherichia coli. Annu Rev Genet. 1984;18:31–68. doi: 10.1146/annurev.ge.18.120184.000335. - DOI - PubMed
    1. Cotter PA, DiRita VJ. Bacterial virulence gene regulation: an evolutionary perspective. Annu Rev Microbiol. 2000;54:519–565. doi: 10.1146/annurev.micro.54.1.519. - DOI - PubMed
    1. Jackson RW, Athanassopoulos E, Tsiamis G, Mansfield JW, Sesma A, et al. Identification of a pathogenicity island, which contains genes for virulence and avirulence, on a large native plasmid in the bean pathogen Pseudomonas syringae pathovar phaseolicola. Proc Natl Acad Sci U S A. 1999;96(19):10875–10880. doi: 10.1073/pnas.96.19.10875. - DOI - PMC - PubMed

LinkOut - more resources