. 2020 Sep 14;21(1):402.

doi: 10.1186/s12859-020-03740-x.

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Aranka Steyaert¹, Pieter Audenaert¹, Jan Fostier²

Affiliations

¹ Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.
² Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. jan.fostier@ugent.be.

PMID: 32928110
PMCID: PMC7491180
DOI: 10.1186/s12859-020-03740-x

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Aranka Steyaert et al. BMC Bioinformatics. 2020.

. 2020 Sep 14;21(1):402.

doi: 10.1186/s12859-020-03740-x.

Authors

Aranka Steyaert¹, Pieter Audenaert¹, Jan Fostier²

Affiliations

¹ Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.
² Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium. jan.fostier@ugent.be.

PMID: 32928110
PMCID: PMC7491180
DOI: 10.1186/s12859-020-03740-x

Abstract

Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data.

Results: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner.

Conclusions: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F₁ score than existing methods. A C++11 implementation is available at https://github.com/biointec/detox under the GNU AGPL v3.0 license.

Keywords: De Bruijn graphs; Next-generation sequencing; Probabilistic graphical models.

PubMed Disclaimer

Conflict of interest statement

A patent application has been submitted by imec vzw based on this methodology.

Figures

**Fig. 1**
Illustration of the issues when assigning multiplicities based only on a k-mer histogram based cutoff. a A k-mer histogram that consists of a mixture of negative binomial distributions. Each component models the k-mer coverage variability for a particular multiplicity. The two-sided arrows delineate the coverage intervals corresponding to a multiplicity estimate. Note how large areas under the curve of a particular multiplicity fall in an interval of a different estimated multiplicity. b Example genome sequence with corresponding reads and de Bruijn graph. Nodes and arcs are labelled with their read coverage. Nodes are also labelled with their corresponding fragment in the genome sequence. Sequencing errors cause spurious nodes and arcs in the de Bruijn graph, such as node e^′ and j. Nodes are encircled according to their true multiplicity (cf. the patterns in Fig. 1a), all correct arcs have true multiplicity 1. c Most likely multiplicity assignment to nodes and arcs based on the k-mer histogram in Fig. 1a. This assignment leads to inconsistencies: nodes where there is a conservation of flow of multiplicity are marked with ✓, nodes where this is violated are marked with ✗. d Multiplicity assignment such that conservation of flow of multiplicity holds in each node. These assignments are correct for all nodes and arcs and reveal sequencing errors (nodes/arcs with multiplicity zero) and the repeat structure of the genome sequence

**Fig. 2**
De Bruijn graph of Fig. 1b. In the de Bruijn graph a neighbourhood of size 1 around node r₄ was selected and the CRF corresponding to this neighbourhood is shown. Nodes Y_n in the CRF correspond to nodes n in the de Bruijn graph, while the Y_a-nodes correspond to the arcs in the de Bruijn graph. Here, each node Y is connected to a node X by an edge arising from a singleton factor φ which contains the information about the coverage of that node or arc in the de Bruijn graph. Note, however, that a node that contains fewer than 2kk-mers will not have such corresponding X and factor φ. All connections between Y-nodes arise from conservation of flow factors. The two cliques arising from the flow through r₄ are shown explicitly

**Fig. 3**
Subgraph of the de Bruijn graph built from the real P. aeruginosa dataset (30×). Nodes and arcs are labelled with their (average) coverage. The fitted node and arc models are shown on the left. The nodes of the graph are coloured according to their true multiplicity. The neighbourhoods of size 1, 2 and 3 surrounding node n₁ are shown as coloured ellipses. The inset table shows the categorical distribution $P (Y_{n_{1}})$ of the multiplicities {0,1,2} for node n₁ for different neighbourhood sizes. For small neighbourhoods (sizes 0 and 1), node n₁ (incorrectly) appears to represent a sequencing error due to its relatively low coverage. At neighbourhood sizes 2 and 3, the CRF model has enough information to (correctly) infer multiplicity 1 for node n₁ with high probability

**Fig. 4**
Part of the q-mer histogram of the real *P. aeruginosa* (25×) dataset. The red line shows the optimal coverage cutoff value. The purple shaded area represents nodes with a coverage below this cutoff value that have true multiplicity 1 whereas the green shaded area represents nodes with a coverage above the cutoff that have true multiplicity 0. These nodes represent false positives and false negatives respectively when using a cutoff-based method. In contrast, the CRF model is able to correctly classify the majority of nodes (lighter shaded purple and green areas)

See this image and copyright information in PMC

Cited by

ELLIPSIS: robust quantification of splicing in scRNA-seq.
Van Hecke M, Beerenwinkel N, Lootens T, Fostier J, Raedt R, Marchal K. Van Hecke M, et al. Bioinformatics. 2025 Feb 4;41(2):btaf028. doi: 10.1093/bioinformatics/btaf028. Bioinformatics. 2025. PMID: 39936571 Free PMC article.

References

1. Pevzner PA, Tang HX, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
1. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai ZW, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng XQ, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000;287(5461):2196–204. doi: 10.1126/science.287.5461.2196. - DOI - PubMed
1. Simpson JT, Pop M. The Theory and Practice of Genome Sequence Assembly In: Chakravarti A, Green E, editors. Annual Review of Genomics and Human Genetics, vol 16: 2015. p. 153–72. 10.1146/annurev-genom-090314-050032. - PubMed
1. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014;30(24):3506–14. doi: 10.1093/bioinformatics/btu538. - DOI - PMC - PubMed
1. Miclotte G, Heydari M, Demeester P, Rombauts S, de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algoritm Mol Biol. 2016; 11. 10.1186/s13015-016-0075-7. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Affiliations

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous