Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 18;17(10):e1009436.
doi: 10.1371/journal.pgen.1009436. eCollection 2021 Oct.

Machine learning to predict the source of campylobacteriosis using whole genome data

Affiliations

Machine learning to predict the source of campylobacteriosis using whole genome data

Nicolas Arning et al. PLoS Genet. .

Abstract

Campylobacteriosis is among the world's most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using the classifier we named aiSource. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.

PubMed Disclaimer

Conflict of interest statement

I have read the journal’s policy and the authors of this manuscript have the following competing interests: DAC declares grants from GlaxoSmithKline and personal fees from Oxford University Innovation, Biobeats, and Sensyne Health, in areas unrelated to this work.

Figures

Fig 1
Fig 1
A heatmap showing classifier performance on the class balanced (A) and imbalanced (B) test set. The individual cells are coloured according to the average accuracy on 200 rounds of resampling with replacement with one standard error noted next to the average accuracy. The averages of accuracy per classifiers are shown in the rightmost column, whereas the bottom column shows the averages per data type.
Fig 2
Fig 2. aiSource (based on XGBoost) performance on cgMLST.
A) Misclassification matrix per source. The diagonal represents correct classification and off-diagonal fields are misclassifications. The percentages are calculated per row. B) Misclassification matrix as depicted in a flow diagram. C) Classifier performance on the unbalanced test set according to four different metrics per source population. D) Radar plot showing the classifier performance on the unbalanced test by seven metrics averaged over 200 rounds of resampling with replacement. The variation is depicted as a shaded surface underneath the black line representing the average.
Fig 3
Fig 3. Source attribution per source, continent, year generalist index and Campylobacter species.
A) Sample sizes across different factors in the imbalanced training set. B) Prediction accuracy on the full test dataset divided by different factors. C) Source attribution of the human samples, as predicted by the XGBoost model trained on the full source associated cgMLST dataset stratified into varying factors.
Fig 4
Fig 4. Comparison of source attribution using aiSource to previously published studies.
Fig 5
Fig 5
Phylogeny of clonal complex 21 of host animal associated samples (A) and bar charts showing the known source distribution and human samples (B) alongside the source distribution predicted by aiSource. The phylogeny is based on Neighbour joining using hamming distance of the k-mers drawn from WGS. The connecting lines show the increase in frequency of the clades in human samples and the size of the grey circles show the posterior probability of a change in phenotypic distribution along the branches of the tree.

References

    1. The European Union One Health 2018 Zoonoses Report. EFSA Journal. 2019;17(12):e05926. doi: 10.2903/j.efsa.2019.5926 - DOI - PMC - PubMed
    1. Kaakoush NO, Castaño-Rodríguez N, Mitchell HM, Man SM. Global Epidemiology of Campylobacter Infection. Clinical Microbiology Reviews. 2015. Jul;28(3):687–720. doi: 10.1128/CMR.00006-15 - DOI - PMC - PubMed
    1. Sheppard SK, Colles FM, McCARTHY ND, Strachan NJC, Ogden ID, Forbes KJ, et al.. Niche segregation and genetic structure of Campylobacter jejuni populations from wild and agricultural host species. Molecular Ecology. 2011;20(16):3484–90. doi: 10.1111/j.1365-294X.2011.05179.x - DOI - PMC - PubMed
    1. Sheppard SK, Colles F, Richardson J, Cody AJ, Elson R, Lawson A, et al.. Host Association of Campylobacter Genotypes Transcends Geographic Variation. Applied and Environmental Microbiology. 2010. Aug;76(15):5269–77. doi: 10.1128/AEM.00124-10 - DOI - PMC - PubMed
    1. Nachamkin I, Allos BM, Ho T. Campylobacter Species and Guillain-Barré Syndrome. Clinical Microbiology Reviews. 1998. Jul;11(3):555–67. doi: 10.1128/CMR.11.3.555 - DOI - PMC - PubMed

Publication types