CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Hector Roux de Bézieux¹, Leandro Lima², Fanny Perraudeau¹, Arnaud Mary³, Sandrine Dudoit⁴, Laurent Jacob³

Affiliations

¹ Pendulum Therapeutics, Inc., San Francisco, CA 94107, USA.
² European Bioinformatics Institute, Cambridge CB10 1SD, UK.
³ Univ. Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Villeurbanne 69100, France.
⁴ Division of Biostatistics, Department of Statistics, University of California, Berkeley, CA 94704, USA.

PMID: 35758804
PMCID: PMC9235473
DOI: 10.1093/bioinformatics/btac238

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Hector Roux de Bézieux et al. Bioinformatics. 2022.

. 2022 Jun 24;38(Suppl 1):i36-i44.

doi: 10.1093/bioinformatics/btac238.

Authors

Hector Roux de Bézieux¹, Leandro Lima², Fanny Perraudeau¹, Arnaud Mary³, Sandrine Dudoit⁴, Laurent Jacob³

Affiliations

¹ Pendulum Therapeutics, Inc., San Francisco, CA 94107, USA.
² European Bioinformatics Institute, Cambridge CB10 1SD, UK.
³ Univ. Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Villeurbanne 69100, France.
⁴ Division of Biostatistics, Department of Statistics, University of California, Berkeley, CA 94704, USA.

PMID: 35758804
PMCID: PMC9235473
DOI: 10.1093/bioinformatics/btac238

Abstract

Motivation: Genome-wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single-nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.

Results: Here, we overcome this issue by testing covariates built from closed connected subgraphs (CCSs) of the de Bruijn graph defined over genomic k-mers. These covariates capture polymorphic genes as a single entity, improving k-mer-based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all CCSs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.

Availability and implementation: We provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Example of de Bruijn graphs. (a) A general example with two genes, each with some variability, resulting in a mostly linear sequence only at the coarse level. More details in Section 4. (b) A simpler setting with two samples and four nodes, leading to three CCSs: ${v_{1}}, {v_{2}}$ and ${v_{0}, v_{1}, v_{2}, v_{3}}$

**Fig. 2.**
A short illustration of CALDERA’s exploration. A simple graph with four nodes and four samples. Nodes v₁, v₂ and v₃ are linked to node v₄. To construct the CCS ${v_{1}, v_{2}, v_{3}, v_{4}}$ from ${v_{3}, v_{4}}$ , we can either first add v₁—and thus construct ${v_{1}, v_{3}, v_{4}}$ —and then add v₂, or directly add v₂ and then we get ${v_{1}, v_{2}, v_{3}, v_{4}}$ by closure. To avoid enumerating ${v_{1}, v_{2}, v_{3}, v_{4}}$ twice, we therefore need local itemtables

**Fig. 3.**
Results of CALDERA. (a) Run times for CALDERA and COIN+LAMP on graphs with various values of p. In this setting, n = 100. (b) Proportion of all unitigs associated with the resistant phenotype that are found to be significant by CALDERA, the LAMP2 procedure on all unitigs and DBGWAS, as the value of α changes

**Fig. 4.**
Akkermansia dataset: Tubulin/FtsZ_GTPase gene. Screenshot from the output of CALDERA. We select the first component, which is the one which contains the most significant CCS. (a) Unitigs belonging to the most significant CCS are colored in green (darker gray in the black and white version of the document). Other unitigs linked to the CCS in the DBG are colored in gray. Size denotes overall frequency, while a black contour denotes that the sequence of the unitig has a match in the database, here *RefSeq* (Tatusova *et al.*, 2016). (b) All significant hits on that database are listed in a panel, usually on top of the subgraph. (c) User can click on nodes to display information, or right-click to select all nodes from the same component. This contains info on the node, such as the frequency, the pattern of the associated CCS, or any match to the database (A color version of this figure appears in the online version of this article)

See this image and copyright information in PMC

References

1. Avis D., Fukuda K. (1996) Reverse search for enumeration. Discrete Appl. Math., 65, 21–46.
1. Bonferroni C. (1936) Teoria Statistica Delle Classi e Calcolo Delle Probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, Vol. 8. Seeber, pp. 3–62.
1. de Bruijn N. (1946) A combinatorial problem. Proc. Sect. Sci. K. Ned. Akad. Wet. Amst., 49, 758–764.
1. Drouin A. et al. (2016) Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics, 17, 1–15. - PMC - PubMed
1. Earle S.G. et al. (2016) Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat. Microbiol., 1, 16041. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Affiliations

CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous