Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 26;12(5):644.
doi: 10.3390/genes12050644.

Species-Specific Quality Control, Assembly and Contamination Detection in Microbial Isolate Sequences with AQUAMIS

Affiliations

Species-Specific Quality Control, Assembly and Contamination Detection in Microbial Isolate Sequences with AQUAMIS

Carlus Deneke et al. Genes (Basel). .

Abstract

Sequencing of whole microbial genomes has become a standard procedure for cluster detection, source tracking, outbreak investigation and surveillance of many microorganisms. An increasing number of laboratories are currently in a transition phase from classical methods towards next generation sequencing, generating unprecedented amounts of data. Since the precision of downstream analyses depends significantly on the quality of raw data generated on the sequencing instrument, a comprehensive, meaningful primary quality control is indispensable. Here, we present AQUAMIS, a Snakemake workflow for an extensive quality control and assembly of raw Illumina sequencing data, allowing laboratories to automatize the initial analysis of their microbial whole-genome sequencing data. AQUAMIS performs all steps of primary sequence analysis, consisting of read trimming, read quality control (QC), taxonomic classification, de-novo assembly, reference identification, assembly QC and contamination detection, both on the read and assembly level. The results are visualized in an interactive HTML report including species-specific QC thresholds, allowing non-bioinformaticians to assess the quality of sequencing experiments at a glance. All results are also available as a standard-compliant JSON file, facilitating easy downstream analyses and data exchange. We have applied AQUAMIS to analyze ~13,000 microbial isolates as well as ~1000 in-silico contaminated datasets, proving the workflow's ability to perform in high throughput routine sequencing environments and reliably predict contaminations. We found that intergenus and intragenus contaminations can be detected most accurately using a combination of different QC metrics available within AQUAMIS.

Keywords: assembly; contamination; interoperability; isolate sequencing; next generation sequencing; pipeline; quality control; reproducibility; whole genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Workflow of AQUAMIS. Raw reads of all datasets are trimmed and quality assessed using fastp. Based on the trimmed reads, contigs are assembled and contaminations are searched, both via a taxonomic classification with abundance estimation and a gene-based approach (purple fields). Based on the assembled contigs, the closest reference to each sample is searched, the assembly quality is assessed, multi-locus sequence typing is performed and contaminations are detected via taxonomic classification again (green fields). The results are presented both in an interactive, configurable R Markdown report and in a structured, computer-readable JSON file. An example report is available at https://bfr_bioinformatics.gitlab.io/AQUAMIS/report_test_data/assembly_report.html, accessed on 26 April 2021.
Figure 2
Figure 2
Boxplots of Quality Control metrics for manually curated inhouse data for Campylobacter, Escherichia, Listeria and Salmonella. The boxes display the median as well as the 25% and 75% quantiles. The lines extend to the 5% and 95% quantiles, respectively. Values outside of the latter are considered outliers and represent potential contaminations.
Figure 3
Figure 3
Selected QC results for contamination datasets (for all species): Shown are the respective values according to their mixing ratio. The points are colored by the species and the shape indicates the contamination type (intra, inter, self). The bars show the applied threshold values (black if for all species, colored if species specific).
Figure 4
Figure 4
Taxonomic classification based on reads and contigs comparison. Compared is the normalized coverage depth vs. the contig length of all contigs from a set of 2721 Salmonella enterica that show different taxonomic classifications based on reads and contigs. Circles denote contigs predicted as chromosomal origin and triangles from plasmids. The coloring indicates the genus from the taxonomic classification of that contig. Clearly, short contigs originating from plasmids frequently occur in high-copy number and are associated to other genera than Salmonella.

References

    1. Uelze L., Grützke J., Borowiak M., Hammerl J.A., Juraschek K., Deneke C., Tausch S.H., Malorny B. Typing methods based on whole genome sequencing data. One Health Outlook. 2020;2:3. doi: 10.1186/s42522-020-0010-1. - DOI - PMC - PubMed
    1. Timme R.E., Wolfgang W.J., Balkey M., Venkata S.L.G., Randolph R., Allard M., Strain E. Optimizing open data to support one health: Best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook. 2020;2:20. doi: 10.1186/s42522-020-00026-3. - DOI - PMC - PubMed
    1. Carrico J.A., Rossi M., Moran-Gilad J., Van Domselaar G., Ramirez M. A primer on microbial bioinformatics for nonbioinformaticians. Clin. Microbiol. Infect. 2018;24:342–349. doi: 10.1016/j.cmi.2017.12.015. - DOI - PubMed
    1. Bogaerts B., Nouws S., Verhaegen B., Denayer S., Van Braekel J., Winand R., Fu Q., Crombe F., Pierard D., Marchal K., et al. Validation strategy of a bioinformatics whole genome sequencing workflow for Shiga toxin-producing Escherichia coli using a reference collection extensively characterized with conventional methods. Microb. Genom. 2021 doi: 10.1099/mgen.0.000531. - DOI - PMC - PubMed
    1. Deneke C., Uelze L., Brendebach H., Tausch S.H., Malorny B. Decentralized investigation of bacterial outbreaks based on hashed cgMLST. Front. Microbiol. 2021 accepted. - PMC - PubMed

Publication types

LinkOut - more resources