Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Sep 27;26(1):313.
doi: 10.1186/s13059-025-03775-4.

Machine learning and statistical inference in microbial population genomics

Affiliations
Review

Machine learning and statistical inference in microbial population genomics

Samuel K Sheppard et al. Genome Biol. .

Abstract

The availability of large genome datasets has changed the microbiology research landscape. Analyzing such data requires computationally demanding analyses, and new approaches have come from different data analysis philosophies. Machine learning and statistical inference have overlapping knowledge discovery aims and approaches. However, machine learning focuses on optimizing prediction, whereas statistical inference focuses on understanding the processes relating variables. In this review, we outline the different aspirations, precepts, and resulting methodologies, with examples from microbial genomics. Emphasizing complementarity, we argue that the combination and synthesis of machine learning and statistics has potential for pathogen research in the big data era.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Modelling and algorithmic approaches. Big data drawn from an example population that describes the objects from which data are randomly sampled (hand). This contains features, otherwise known as independent variables, predictors or regressors, and outcomes, otherwise known as dependent variables, labels, classes, or targets, whereby changes in the features lead to changes in the outcomes. Relating the two is the data generating process, or “nature”. Statistics (or, more precisely, data modelling in Breiman’s dichotomy [12]) aims to understand the underlying processes while ML (or, more precisely, algorithmic modelling in Breiman’s dichotomy) aims to faithfully reproduce the observed patterns to achieve optimal prediction, for instance
Fig. 2
Fig. 2
Machine learning workflow in classification tasks. The data is split into training and testing, after which a suitable general-purpose algorithm is chosen, its hyper-parameters tuned and fitted to the training data. The performance of the fitted classifier is subsequently measured using a metric of choice

References

    1. Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, et al. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol. 2021;19:e3001421 (Hanage WP, editor.). - PMC - PubMed
    1. Wong ZSY, Zhou J, Zhang Q. Artificial intelligence for infectious disease big data analytics. Infect Dis Health. 2019;24:44–8. - PubMed
    1. Ow GS, Tang Z, Kuznetsov VA. Big data and computational biology strategy for personalized prognosis. Oncotarget. 2016;7:40200–20. - PMC - PubMed
    1. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the Opportunities and Risks of Foundation Models. arXiv; 2021 Available from: https://arxiv.org/abs/2108.07258. [cited 2025 Sept 2].
    1. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. - PMC - PubMed

MeSH terms

LinkOut - more resources