Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 24;38(Suppl 1):i36-i44.
doi: 10.1093/bioinformatics/btac238.

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Affiliations

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Hector Roux de Bézieux et al. Bioinformatics. .

Abstract

Motivation: Genome-wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single-nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.

Results: Here, we overcome this issue by testing covariates built from closed connected subgraphs (CCSs) of the de Bruijn graph defined over genomic k-mers. These covariates capture polymorphic genes as a single entity, improving k-mer-based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all CCSs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.

Availability and implementation: We provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Example of de Bruijn graphs. (a) A general example with two genes, each with some variability, resulting in a mostly linear sequence only at the coarse level. More details in Section 4. (b) A simpler setting with two samples and four nodes, leading to three CCSs: {v1},{v2} and {v0,v1,v2,v3}
Fig. 2.
Fig. 2.
A short illustration of CALDERA’s exploration. A simple graph with four nodes and four samples. Nodes v1, v2 and v3 are linked to node v4. To construct the CCS {v1,v2,v3,v4} from {v3,v4}, we can either first add v1—and thus construct {v1,v3,v4}—and then add v2, or directly add v2 and then we get {v1,v2,v3,v4} by closure. To avoid enumerating {v1,v2,v3,v4} twice, we therefore need local itemtables
Fig. 3.
Fig. 3.
Results of CALDERA. (a) Run times for CALDERA and COIN+LAMP on graphs with various values of p. In this setting, n =100. (b) Proportion of all unitigs associated with the resistant phenotype that are found to be significant by CALDERA, the LAMP2 procedure on all unitigs and DBGWAS, as the value of α changes
Fig. 4.
Fig. 4.
Akkermansia dataset: Tubulin/FtsZ_GTPase gene. Screenshot from the output of CALDERA. We select the first component, which is the one which contains the most significant CCS. (a) Unitigs belonging to the most significant CCS are colored in green (darker gray in the black and white version of the document). Other unitigs linked to the CCS in the DBG are colored in gray. Size denotes overall frequency, while a black contour denotes that the sequence of the unitig has a match in the database, here RefSeq (Tatusova et al., 2016). (b) All significant hits on that database are listed in a panel, usually on top of the subgraph. (c) User can click on nodes to display information, or right-click to select all nodes from the same component. This contains info on the node, such as the frequency, the pattern of the associated CCS, or any match to the database (A color version of this figure appears in the online version of this article)

References

    1. Avis D., Fukuda K. (1996) Reverse search for enumeration. Discrete Appl. Math., 65, 21–46.
    1. Bonferroni C. (1936) Teoria Statistica Delle Classi e Calcolo Delle Probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, Vol. 8. Seeber, pp. 3–62.
    1. de Bruijn N. (1946) A combinatorial problem. Proc. Sect. Sci. K. Ned. Akad. Wet. Amst., 49, 758–764.
    1. Drouin A. et al. (2016) Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics, 17, 1–15. - PMC - PubMed
    1. Earle S.G. et al. (2016) Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat. Microbiol., 1, 16041. - PMC - PubMed

Publication types