Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

Affiliations

PMID: 30894839
PMCID: PMC6414443
DOI: 10.3389/fmicb.2019.00362

Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

Bert Bogaerts et al. Front Microbiol. 2019.

. 2019 Mar 6:10:362.

doi: 10.3389/fmicb.2019.00362. eCollection 2019.

Affiliations

¹ Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium.
² Bacterial Diseases, Sciensano, Brussels, Belgium.

PMID: 30894839
PMCID: PMC6414443
DOI: 10.3389/fmicb.2019.00362

Abstract

Despite being a well-established research method, the use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization remains a substantial challenge due to the required bioinformatics resources and/or expertise. Moreover, many national reference laboratories and centers, as well as other laboratories working under a quality system, require extensive validation to demonstrate that employed methods are "fit-for-purpose" and provide high-quality results. A harmonized framework with guidelines for the validation of WGS workflows does currently, however, not exist yet, despite several recent case studies highlighting the urgent need thereof. We present a validation strategy focusing specifically on the exhaustive characterization of the bioinformatics analysis of a WGS workflow designed to replace conventionally employed molecular typing methods for microbial isolates in a representative small-scale laboratory, using the pathogen Neisseria meningitidis as a proof-of-concept. We adapted several classically employed performance metrics specifically toward three different bioinformatics assays: resistance gene characterization (based on the ARG-ANNOT, ResFinder, CARD, and NDARO databases), several commonly employed typing schemas (including, among others, core genome multilocus sequence typing), and serogroup determination. We analyzed a core validation dataset of 67 well-characterized samples typed by means of classical genotypic and/or phenotypic methods that were sequenced in-house, allowing to evaluate repeatability, reproducibility, accuracy, precision, sensitivity, and specificity of the different bioinformatics assays. We also analyzed an extended validation dataset composed of publicly available WGS data for 64 samples by comparing results of the different bioinformatics assays against results obtained from commonly used bioinformatics tools. We demonstrate high performance, with values for all performance metrics >87%, >97%, and >90% for the resistance gene characterization, sequence typing, and serogroup determination assays, respectively, for both validation datasets. Our WGS workflow has been made publicly available as a "push-button" pipeline for Illumina data at https://galaxy.sciensano.be to showcase its implementation for non-profit and/or academic usage. Our validation strategy can be adapted to other WGS workflows for other pathogens of interest and demonstrates the added value and feasibility of employing WGS with the aim of being integrated into routine use in an applied public health setting.

Keywords: Neisseria meningitidis; national reference center; public health; validation; whole-genome sequencing.

PubMed Disclaimer

Figures

**FIGURE 1**
Overview of the bioinformatics workflow. Each box represents a component corresponding to a series of tasks that provide a certain well-defined functionality (indicated in bold). Major bioinformatics utilities employed in each module are also mentioned (indicated in italics). Abbreviations: paired-end (PE).

**FIGURE 2**
Reproducibility of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing runs that are being compared, while the ordinate represents the percentage of cgMLST loci that were concordant between the same samples of different sequencing runs. Note that the ordinate starts at 94% instead of 0% to enable illustrating the variation between run comparisons more clearly. Each comparison is presented as a boxplot based on 67 samples where the boundary of the box closest to the abscissa indicates the 25th percentile, the thick line inside the box indicates the median, and the boundary of the box farthest from the abscissa indicates the 75th percentile. See also Supplementary Table S9 for detailed values for all samples and sequencing runs.

**FIGURE 3**
Database standard results of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing run, while the ordinate represents the percentages of cgMLST loci as indicated by the title above each graph. Each sequencing run is presented as a boxplot based on 67 samples (see the legend of Figure 2 for a brief explanation). The upper left graph depicts the percentage of concordant cgMLST loci, i.e., where our workflow identified the same allele as the database standard, which were classified as TPs. Note that the ordinate starts at 93% instead of 0% to enable illustrating the results more clearly. All other cases were classified as FNs, and encompass three categories. First, the upper right graph depicts the percentage of cgMLST loci for which our workflow detected a different allele than present in the database standard. Second, the bottom left graph depicts the percentage of cgMLST loci for which our workflow did not detect any allele but an allele was nevertheless present in the database standard. Third, the bottom right graph depicts the percentage of cgMLST loci for which our workflow detected an allele but for which no allele was present in the database standard. Most FNs are explained by no information being present in the database standard, followed by an actual mismatch, and only few cases are due to our workflow improperly not detecting an allele. See also Supplementary Table S10 for detailed values for all samples and runs.

**FIGURE 4**
Tool standard results of the sequence typing assay for the core validation dataset. The abscissa depicts the sequencing run, while the ordinate represents the percentages of cgMLST loci as indicated by the title above each graph. Each sequencing run is presented as a boxplot based on 67 samples (see the legend of Figure 2 for a brief explanation). The upper left graph depicts the percentage of concordant cgMLST loci, i.e., where our workflow identified the same allele as the tool standard, which were classified as TPs. Note that the ordinate starts at 98% instead of 0% to enable illustrating the results more clearly. All other cases were classified as FNs, and encompass two categories. First, the upper right graph depicts the percentage of cgMLST loci for which our workflow identified multiple perfect hits, of which at least one corresponded to the tool standard but was reported differently. Second, the lower left graph depicts the percentage of cgMLST loci for which our workflow detected a different allele compared to the tool standard. Most FNs are therefore explained by a different manner of handling multiple perfect hits, and only a small minority are due to an actual mismatch between our workflow and the tool standard. Furthermore, upon closer inspection, these mismatches were due to an artifact of the reference tool used for the tool standard that has been resolved in the meantime (see Supplementary Figure S2). See also Supplementary Table S11 for detailed values for all samples and runs.

See this image and copyright information in PMC

References

1. Aanensen D. M., Feil E. J., Holden M. T. G., Dordel J., Yeats C. A., Fedosejev A., et al. (2016). Whole-genome sequencing for routine pathogen surveillance in public health: a population snapshot of invasive Staphylococcus aureus in Europe. MBio 7:e00444-16. 10.1128/mBio.00444-16 - DOI - PMC - PubMed
1. Afgan E., Baker D., van den Beek M., Blankenberg D., Bouvier D., Čech M., et al. (2016). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44 W3–W10. 10.1093/nar/gkw343 - DOI - PMC - PubMed
1. Allard M. W. (2016). The future of whole-genome sequencing for public health and the clinic. J. Clin. Microbiol. 54 1946–1948. 10.1128/JCM.01082-16 - DOI - PMC - PubMed
1. Angers-Loustau A., Petrillo M., Bengtsson-Palme J., Berendonk T., Blais B., Chan K.-G., et al. (2018). The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research 7:459. 10.12688/f1000research.14509.1 - DOI - PMC - PubMed
1. Bankevich A., Nurk S., Antipov D., Gurevich A. A., Dvorkin M., Kulikov A. S., et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19 455–477. 10.1089/cmb.2012.0021 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

Affiliations

Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources