Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Jul;29(7):621-633.
doi: 10.1016/j.tim.2020.12.002. Epub 2021 Jan 14.

Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Affiliations
Review

Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Jonathan P Allen et al. Trends Microbiol. 2021 Jul.

Abstract

The advent of inexpensive and rapid sequencing technologies has allowed bacterial whole-genome sequences to be generated at an unprecedented pace. This wealth of information has revealed an unanticipated degree of strain-to-strain genetic diversity within many bacterial species. Awareness of this genetic heterogeneity has corresponded with a greater appreciation of intraspecies variation in virulence. A number of comparative genomic strategies have been developed to link these genotypic and pathogenic differences with the aim of discovering novel virulence factors. Here, we review recent advances in comparative genomic approaches to identify bacterial virulence determinants, with a focus on genome-wide association studies and machine learning.

Keywords: bacteria; genomics; virulence.

PubMed Disclaimer

Figures

Figure 1. Key Figure.
Figure 1. Key Figure.
General approach for the identification of virulence genes using comparative genomic strategies. (I) Large collections of isolates of a particular pathogen must be collected. (II) The virulence of each isolate within the collection is determined from observations of natural infections or by using infection models in the laboratory. (III) In parallel, whole genome sequences are obtained for each isolate and (IV) genetic differences (SNPs, indels or genes) are mapped for the entire collection (depicted by red and green points on DNA helix). (V) The association of sequence variants with virulence can be bioinformatically determined while accounting for confounding effects to limit spurious associations (depicted by enlarged DNA helix with red point). (VI) Virulence-associated genetic elements can be validated with an independent cohort of bacterial isolates or genetic testing to confirm loss or gain of virulence.
Figure 2.
Figure 2.
Use of k-mers to define genetic differences. (A) The genomes of two separate isolates (represented as colored lines) are depicted aligned to a reference genome (black line). Mapping sequencing reads to a reference genome can only identify polymorphisms from aligned regions and excludes those regions unique to test strains. These limitations can be overcome with alignment-free methods using k-mers. (B) k-mers represent the complete and overlapping set of sequences of k nucleotides in length found in a specified genome. In this example, a sequencing read is broken down into overlapping k-mers of length k = 4. (C) Two identical genomes contain identical sets of k-mers, but a single SNP would generate slightly different sets of k-mers for each genome. Here, a genetic locus is shown in which Isolates 1 and 2 differ by a single SNP. Representative isolate-specific k-mers caused by this SNP are shown for the condition in which k = 4. (D) In comparative genomic analyses, complete sets of k-mers from each isolate are generated, and (I) k-mers associated with a test group (red) or control group (blue) are identified while k-mers found in both test and control groups (grey) can be excluded. (II) Overlapping k-mers can be assembled into larger contigs to facilitate annotation. (III) Polymorphisms associated with virulence, including the presence of entire genes and specific SNPs within a gene, can then be identified and further characterized.
Figure 3.
Figure 3.
Confounding due to population stratification can result in spurious genotype-phenotype associations. In this example, alleles in 3 individual genes (squares in a row) are differentially colored. Without recombination, fixed, non-causal genetic variants (green) will be passed onto bacterial descendants and be in linkage disequilibrium with other causal mutations that occur in that lineage (red). (i) Homoplasy counting methods, such as phyC, search for causal associations occurring in separate lineages (red) and exclude lineage specific non-causal events (green).
Figure 4.
Figure 4.
Previously acquired data (I) can be split into separate datasets used for (II) ‘training’ a predictive model and (V) ‘testing’ its performance. Both features (e.g., SNPs) and labels (e.g., virulence) from the training set considered by a learning algorithm (III) to generate a predictive model (IV) that best fits the virulence labels. Performance of the model (VI) is then assessed by using SNPs from the ‘test’ dataset (V) to predict strain virulence and comparing to the true labels for each strain. The model can then be used to predict the virulence of other sequenced strains based on their genomes (VII). As more data become available, this process can be repeated and models refined with the goal of improving performance. SNPs, single-nucleotide polymorphisms.

References

    1. Dykhuizen D (2005) Species Numbers in Bacteria. Proc Calif Acad Sci 56, 62–71 - PMC - PubMed
    1. Casadevall A and Pirofski LA (2003) The damage-response framework of microbial pathogenesis. Nature reviews. Microbiology 1, 17–24 - PMC - PubMed
    1. Dickey SW, et al. (2017) Different drugs for bad bugs: antivirulence strategies in the age of antibiotic resistance. Nat Rev Drug Discov 16, 457–471 - PMC - PubMed
    1. Fleischmann RD, et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 - PubMed
    1. Didelot X, et al. (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13, 601–612 - PMC - PubMed

Publication types

Substances