. 2025 Apr 28;16(1):3961.

doi: 10.1038/s41467-025-59246-8.

Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens

Verónica Mixão¹, Miguel Pinto¹, Holger Brendebach², Daniel Sobral¹, João Dourado Santos¹, Nicolas Radomski³, Anne Sophie Majgaard Uldall⁴, Arkadiusz Bomba⁵, Michael Pietsch⁶, Andrea Bucciacchio³, Andrea de Ruvo^{3

7}, Pierluigi Castelli³, Ewelina Iwan⁵, Sandra Simon⁶, Claudia E Coipan⁸, Jörg Linde⁹, Liljana Petrovska¹⁰, Rolf Sommer Kaas¹¹, Katrine Grimstrup Joensen⁴, Sofie Holtsmark Nielsen⁴, Kristoffer Kiil⁴, Karin Lagesen¹², Adriano Di Pasquale³, João Paulo Gomes^{1

13}, Carlus Deneke², Simon H Tausch², Vítor Borges¹⁴

Affiliations

¹ Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal.
² National Study Center for Sequencing, Department of Biological Safety, German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
³ National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: database and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise (IZSAM), Teramo, Italy.
⁴ Department of Bacteria, Parasites & Fungi, Statens Serum Institut (SSI), Copenhagen, Denmark.
⁵ Department of Omics Analyses, National Veterinary Research Institute (PIWet), Puławy, Poland.
⁶ Unit of Enteropathogenic Bacteria and Legionella, Robert Koch Institute (RKI), Wernigerode, Germany.
⁷ Computer Science, Gran Sasso Science Institute, L'Aquila, Italy.
⁸ Department for Infectious Diseases, Epidemiology and Surveillance, National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands.
⁹ Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institute (FLI), Jena, Germany.
¹⁰ Animal and Plant Health Agency (APHA), Addlestone, Surrey, UK.
¹¹ National Food Institute, Technical University of Denmark (DTU), Lyngby, Denmark.
¹² Section for Epidemiology, Norwegian Veterinary Institute (NVI), Ås, Norway.
¹³ Veterinary and Animal Research Center (CECAV), Faculty of Veterinary Medicine, Lusófona University, Lisbon, Portugal.
¹⁴ Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal. vitor.borges@insa.min-saude.pt.

PMID: 40295532
PMCID: PMC12038046
DOI: 10.1038/s41467-025-59246-8

Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens

Verónica Mixão et al. Nat Commun. 2025.

. 2025 Apr 28;16(1):3961.

doi: 10.1038/s41467-025-59246-8.

Authors

Affiliations

¹ Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal.
² National Study Center for Sequencing, Department of Biological Safety, German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
³ National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: database and bioinformatics analysis (GENPAT), Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise (IZSAM), Teramo, Italy.
⁴ Department of Bacteria, Parasites & Fungi, Statens Serum Institut (SSI), Copenhagen, Denmark.
⁵ Department of Omics Analyses, National Veterinary Research Institute (PIWet), Puławy, Poland.
⁶ Unit of Enteropathogenic Bacteria and Legionella, Robert Koch Institute (RKI), Wernigerode, Germany.
⁷ Computer Science, Gran Sasso Science Institute, L'Aquila, Italy.
⁸ Department for Infectious Diseases, Epidemiology and Surveillance, National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands.
⁹ Institute of Bacterial Infections and Zoonoses, Friedrich-Loeffler-Institute (FLI), Jena, Germany.
¹⁰ Animal and Plant Health Agency (APHA), Addlestone, Surrey, UK.
¹¹ National Food Institute, Technical University of Denmark (DTU), Lyngby, Denmark.
¹² Section for Epidemiology, Norwegian Veterinary Institute (NVI), Ås, Norway.
¹³ Veterinary and Animal Research Center (CECAV), Faculty of Veterinary Medicine, Lusófona University, Lisbon, Portugal.
¹⁴ Genomics and Bioinformatics Unit, Department of Infectious Diseases, National Institute of Health Doutor Ricardo Jorge (INSA), Lisbon, Portugal. vitor.borges@insa.min-saude.pt.

PMID: 40295532
PMCID: PMC12038046
DOI: 10.1038/s41467-025-59246-8

Abstract

Different laboratories employ different Whole-Genome Sequencing (WGS) pipelines for Food and Waterborne disease (FWD) surveillance, casting doubt on the comparability of their results and hindering optimal communication at intersectoral and international levels. Through a collaborative effort involving eleven European institutes spanning the food, animal, and human health sectors, we aimed to assess the inter-pipeline clustering congruence across all resolution levels and perform an in-depth comparative analysis of cluster composition at outbreak level for four important foodborne pathogens: Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni. We found a general concordance between allele-based pipelines for all species, except for C. jejuni, where the different resolution power of allele-based schemas led to marked discrepancies. Still, we identified non-negligible differences in outbreak detection and demonstrated how a threshold flexibilization favors the detection of similar outbreak signals by different laboratories. These results, together with the observation that different traditional typing groups (e.g., serotypes) exhibit a remarkably different genetic diversity, represent valuable information for future outbreak case-definitions and WGS-based nomenclature design. This study reinforces the need, while demonstrating the feasibility, of conducting continuous pipeline comparability assessments, and opens good perspectives for a smoother international and intersectoral cooperation towards an efficient One Health FWD surveillance.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

**Fig. 1. Summary of the different countries and sectors involved in the assessment of pipeline cluster congruence.**
The diversity of pipelines used for FWD surveillance is indicated per country, sector, and species.

**Fig. 2. Assessment of allele-based clustering at all possible threshold levels for *L. monocytogenes* and comparison with traditional MLST.**
a Composition of the *L. monocytogenes* dataset used in this study in terms of ST in comparison with datasets of previous studies (Maury et al. 2016 and Moura et al. 2016), the LiSeq project and the BIGSdb database, as of November 2021. A GrapeTree visualization of the MST obtained with the INNUENDO-like pipeline is shown. Nodes (i.e., samples) are collapsed at the threshold with the highest congruence with CC (508 ADs for this pipeline) and colored according to the ST classification. b Number of partitions obtained by each pipeline at each possible distance threshold. c Clustering stability regions determined for each pipeline. To better distinguish each region (represented by separated rectangle blocks), the different blocks are vertically phased, starting in a different line. Distance thresholds (x axis) are presented in log2 scale. d Barplot (top) with the number of samples of the top represented STs (≥50 samples) in *L. monocytogenes* dataset, with a swarmplot (bottom) indicating the AD threshold at which each pipeline clusters together all samples of each ST. e Distribution of the AD thresholds at which each pipeline clusters together all samples of a given ST (n = 219). Boxplots show the interquartile range (25% to 75%) and median, and whiskers extend 1.5 times the range, with outliers (diamond symbol) plotted separately. The outlier STs are indicated above the respective symbol. Source data are provided as a Source Data file.

Fig. 3. Cluster congruence at all threshold levels and overlap in detecting outbreak signals for *L. monocytogenes.*
a Heatmap with the CS of two pipelines (details on each pairwise comparison are in Supplementary Data 2, with chewieSnake vs. Bionumerics using the HC algorithm being presented here as an example). The inverted dendrogram (i.e., from the highest to the lowest resolution) and dashed red lines illustrate how the congruence is related with the dataset’s phylogenetic structure (dendrogram obtained with Bionumerics and visualized in auspice.us). b Zoom-in in the high resolution level highlighted in the orange square of (a). c Bi-directional corresponding points (gray lines) connecting thresholds providing similar clustering in the two pipelines exemplified in (a). d Illustrative linear trend lines expected for the corresponding points with a slope deviation of 10% and 20% to be used as scale reference for the boxplots. Boxplots present the slope distribution for allele vs. allele (orange, n = 58) and SNP. vs. SNP (blue, n = 22) pipeline comparisons for the linear trend lines with a r² ≥ 0.99, illustrated in Supplementary Data 2 and detailed in Supplementary Data 6 (“n” refers to the number of comparisons with r² ≥ 0.99 over the total number of comparisons). The boxplot of the allele vs. SNP scenario is not presented due to the low number of comparisons with r² ≥ 0.99 (Supplementary Data 6). e Density of the distance thresholds required for the identification of clusters detected by at least one allele-based pipeline at 7 ADs. Only clusters having the same composition in all allele-based pipelines were included (n = 316). f Distribution of the difference between the minimum and maximum AD threshold needed to detect the same clusters across allele-based pipelines, using the clusters of (e) (n = 316). g Overlap between the genetic clusters detected at 7 ADs. h Overlap between the genetic clusters detected by one pipeline at 7 ADs and those detected by the others at ≤ 9 ADs. Boxplots in (d) and (f) show the interquartile range and median, and whiskers extend 1.5 times the range, with outliers plotted separately. Source data are provided as a Source Data file.

**Fig. 4. Assessment of allele-based clustering at all possible threshold levels for S.enterica and comparison with traditional MLST and serotype.**
a Composition of the *S. enterica* dataset used in this study in terms of serotype and in comparison with datasets of previous studies (INNUENDO and BioProject PRJEB20997), and the Enterobase database, as of November 2021. A GrapeTree visualization of the MST obtained with the INNUENDO-like-INNUENDO99 pipeline is shown. Nodes (i.e., samples) are collapsed at the threshold with highest congruence with serotype (1514 ADs for this pipeline) and colored according to the ST classification. b Number of partitions obtained by each pipeline at each possible distance threshold. c Clustering stability regions determined for each pipeline. To better distinguish each region (represented by separated rectangle blocks), the different blocks are vertically phased, starting in a different line. Distance thresholds (x axis) are presented in log2 scale. d Barplot (top) with the number of samples of the top represented serotypes (≥50 samples) in *S. enterica* dataset, with a swarmplot (bottom) indicating the AD threshold at which each pipeline clusters together all samples of each serotype. Source data are provided as a Source Data file.

Fig. 5. Cluster congruence at all threshold levels and of overlap in detecting outbreak signals for *S. enterica.*
a Heatmap with the CS of two pipelines (details on each pairwise comparison are in Supplementary Data 10, with Bionumerics vs. chewieSnake using the HC algorithm being presented here as an example). The inverted dendrogram (i.e., from the highest to the lowest resolution) and dashed red lines illustrate how the congruence is related to the dataset’s phylogenetic structure (dendrogram obtained with chewieSnake and visualized in auspice.us). b Zoom-in in on the high resolution level highlighted in orange in (a). c Bi-directional corresponding points (gray lines) connecting thresholds providing similar clustering in the two pipelines exemplified in (a). d Illustrative linear trend lines expected for the corresponding points with a slope deviation of 10% and 20% to be used as scale reference for the boxplots. The boxplot presents the slope distribution for allele vs. allele (orange, n = 90) pipeline comparisons for the linear trend lines with r² ≥ 0.99, illustrated in Supplementary Data 10 and detailed in Supplementary Data 14 (“n” refers to the number of comparisons with r² ≥ 0.99 over the total number of comparisons). The boxplots of the SNP vs. SNP and allele vs. SNP scenarios are not presented due to the low number of comparison with r² ≥ 0.99 (Supplementary Data 14). e Density of the distance thresholds required for the identification of clusters detected by at least one allele-based pipeline at 14 ADs. Only clusters having the same composition in all allele-based pipelines were included (n = 255). f Distribution of the difference between the minimum and maximum AD threshold needed to detect the same clusters across allele-based pipelines, using the clusters of (e) (n = 255). g Overlap between the genetic clusters detected at 14 ADs. h Overlap between the genetic clusters detected by one pipeline at 14 ADs and those detected by the others at ≤16 ADs. Boxplots in (d) and (f) show the interquartile range and median, and whiskers extend 1.5 times the range, with outliers plotted separately. Source data is provided as a Source Data file.

**Fig. 6. Assessment of allele-based clustering at all possible threshold levels for E. coli and comparison with traditional MLST and serotype.**
a Composition of the *E. coli* dataset used in this study in terms of serotype in comparison with the composition of the datasets of previous studies (INNUENDO and BioProject PRJNA230969^,), and the Enterobase database, as of November 2021. A GrapeTree visualization of the MST obtained with the INNUENDO-like-INNUENDO99 pipeline is shown. Nodes (i.e., samples) are collapsed at the threshold with the highest congruence with serotype (620 ADs for this pipeline) and colored according to the ST classification. b Number of partitions obtained by each pipeline at each possible distance threshold. c Clustering stability regions are determined for each pipeline. To better distinguish each region (represented by separate rectangular blocks), the different blocks are vertically phased, starting in a different line. Distance thresholds (x axis) are presented in log2 scale. d Barplot (top) with the number of samples of the most represented serotype (O157:H7) and ST (ST11) in *E. coli* dataset, with a swarmplot (bottom) indicating the AD threshold at which each pipeline clusters together all samples of each of them. Source data are provided as a Source Data file.

Fig. 7. Cluster congruence at all threshold levels and overlap in detecting outbreak signals for *E. coli.*
a Heatmap with the CS of two pipelines (details on each pairwise comparison are in Supplementary Data 19, with INNUENDO-like-Enterobase vs. INNUENDO-like-INNUENDO99 using the HC algorithm being presented here as an example). The inverted dendrogram (i.e., from the highest to the lowest resolution) and dashed red lines illustrate how the congruence is related with the dataset’s phylogenetic structure (dendrogram obtained with INNUENDO-like-INNUENDO99 and visualized in auspice.us). b Zoom-in in the high resolution level highlighted in orange in (a). c Bi-directional corresponding points (gray lines) connecting thresholds providing similar clustering in the two pipelines exemplified in (a). d Illustrative linear trend lines expected for the corresponding points with a slope deviation of 10% and 20% to be used as scale reference for the boxplots. The boxplot presents the slope distribution for allele vs. allele (orange, n = 68) pipeline comparisons for the linear trend lines with r² ≥ 0.99, illustrated in Supplementary Data 19 and detailed in Supplementary Data 23 (“n” refers to the number of comparisons with r² ≥ 0.99 over the total number of comparisons). The boxplot of the allele vs. SNP scenario is not presented due to the low number of comparisons with r² ≥ 0.99 (Supplementary Data 23). e Density of the distance thresholds required for the identification of clusters detected by at least one allele-based pipeline at 9 ADs. Only clusters having the same composition in all allele-based pipelines were included (n = 185). f Distribution of the difference between the minimum and maximum AD threshold needed to detect the same clusters across allele-based pipelines, using the clusters of (e) (n = 185). g Overlap between the genetic clusters detected at 9 ADs. h Overlap between the genetic clusters detected by one pipeline at 9 ADs and those detected by the others at ≤ 12 ADs. Boxplots in (d) and (f) show the interquartile range and median, and whiskers extend 1.5 times the range, with outliers plotted separately. Source data are provided as a Source Data file.

**Fig. 8. Assessment of allele-based clustering at all possible threshold levels for C.jejuni and comparison with traditional MLST.**
a Composition of the *C. jejuni* dataset used in this study in terms of CC and in comparison with the composition of the INNUENDO dataset and the PubMLST database, as of November 2021. A GrapeTree visualization of the MST obtained with the INNUENDO-like-PubMLST pipeline is shown. Nodes (i.e., samples) are collapsed at the threshold with highest congruence with CC (839 ADs for this pipeline) and colored according to the ST classification. b Number of partitions obtained by each pipeline at each possible distance threshold. c Clustering stability regions determined for each pipeline. To better distinguish each region (represented by separated rectangle blocks), different blocks are vertically phased, starting in a different line. Distance thresholds (x axis) are presented in log2 scale. d Barplot (top) with the number of samples of the top represented CCs (≥50 samples) in *C. jejuni* dataset, with a swarmplot (bottom) indicating the AD threshold at which each pipeline clusters together all samples of each CC. Source data are provided as a Source Data file.

**Fig. 9. Cluster congruence at all threshold levels and overlap in detecting outbreak signals for C.**
***jejuni*. a** Heatmap with the CS of two pipelines (details on each pairwise comparison are in Supplementary Data 28, with Bionumerics vs. INNUENDO-like-INNUENDO99 using the HC algorithm being presented here as an example). The inverted dendrogram (i.e., from the highest to the lowest resolution) and dashed red lines illustrate how the congruence is related with the dataset’s phylogenetic structure (dendrogram obtained with INNUENDO-like-INNUENDO99 and visualized in auspice.us). b Zoom-in in the high resolution level highlighted in orange in (a). c Bi-directional corresponding points (gray lines) connecting thresholds providing similar clustering results in the two pipelines exemplified in (a). d Illustrative linear trend lines expected for the corresponding points with a slope deviation of 10% and 20% to be used as scale reference for the boxplots. The boxplot presents the slope distribution for allele vs. allele (orange, n = 104) pipeline comparisons for the linear trend lines with r² ≥ 0.99, illustrated in Supplementary Data 28 and detailed in Supplementary Data 32 (“n” refers to the number of comparisons with r² ≥ 0.99 over the total number of comparisons). The boxplots of the SNP vs. SNP and allele vs. SNP scenarios are not presented due to the low number of comparisons with r² ≥ 0.99 (Supplementary Data 32). e Density of the distance thresholds required for the identification of clusters detected by at least one allele-based pipeline at 4 ADs. Only clusters having the same composition in all allele-based pipelines were included (n = 430). f Distribution of the difference between the minimum and maximum AD threshold needed to detect the same clusters across allele-based pipelines, using the clusters of (e) (n = 430). g Overlap between the genetic clusters detected at 4 ADs. h Overlap between the genetic clusters detected by one pipeline at 4 ADs and those detected by the others at ≤4 ADs. Boxplots in (d) and (f) show the interquartile range and median, and whiskers extend 1.5 times the range, with outliers plotted separately. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. WHO. Estimating the burden of foodborne diseases: A practical handbook for countries. https://www.who.int/publications/i/item/9789240012264 (2021).
1. Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet.19, 9–20 (2018). - PMC - PubMed
1. Mackenzie, J. S. & Jeggo, M. The One Health Approach-Why Is It So Important? Trop. Med. Infect. Dis.4, 88 (2019). - PMC - PubMed
1. Gerner-Smidt, P. et al. Whole genome sequencing: bridging one-health surveillance of foodborne diseases. Front Public Health7, 172 (2019). - PMC - PubMed
1. Struelens, M. J. et al. Real-time genomic surveillance for enhanced control of infectious diseases and antimicrobial resistance. Front. Sci. Ser.2, 1298248 (2024).

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens

Affiliations

Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical