Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 14;21(1):402.
doi: 10.1186/s12859-020-03740-x.

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Affiliations

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

Aranka Steyaert et al. BMC Bioinformatics. .

Abstract

Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data.

Results: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner.

Conclusions: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detox under the GNU AGPL v3.0 license.

Keywords: De Bruijn graphs; Next-generation sequencing; Probabilistic graphical models.

PubMed Disclaimer

Conflict of interest statement

A patent application has been submitted by imec vzw based on this methodology.

Figures

Fig. 1
Fig. 1
Illustration of the issues when assigning multiplicities based only on a k-mer histogram based cutoff. a A k-mer histogram that consists of a mixture of negative binomial distributions. Each component models the k-mer coverage variability for a particular multiplicity. The two-sided arrows delineate the coverage intervals corresponding to a multiplicity estimate. Note how large areas under the curve of a particular multiplicity fall in an interval of a different estimated multiplicity. b Example genome sequence with corresponding reads and de Bruijn graph. Nodes and arcs are labelled with their read coverage. Nodes are also labelled with their corresponding fragment in the genome sequence. Sequencing errors cause spurious nodes and arcs in the de Bruijn graph, such as node e and j. Nodes are encircled according to their true multiplicity (cf. the patterns in Fig. 1a), all correct arcs have true multiplicity 1. c Most likely multiplicity assignment to nodes and arcs based on the k-mer histogram in Fig. 1a. This assignment leads to inconsistencies: nodes where there is a conservation of flow of multiplicity are marked with ✓, nodes where this is violated are marked with ✗. d Multiplicity assignment such that conservation of flow of multiplicity holds in each node. These assignments are correct for all nodes and arcs and reveal sequencing errors (nodes/arcs with multiplicity zero) and the repeat structure of the genome sequence
Fig. 2
Fig. 2
De Bruijn graph of Fig. 1b. In the de Bruijn graph a neighbourhood of size 1 around node r4 was selected and the CRF corresponding to this neighbourhood is shown. Nodes Yn in the CRF correspond to nodes n in the de Bruijn graph, while the Ya-nodes correspond to the arcs in the de Bruijn graph. Here, each node Y is connected to a node X by an edge arising from a singleton factor φ which contains the information about the coverage of that node or arc in the de Bruijn graph. Note, however, that a node that contains fewer than 2kk-mers will not have such corresponding X and factor φ. All connections between Y-nodes arise from conservation of flow factors. The two cliques arising from the flow through r4 are shown explicitly
Fig. 3
Fig. 3
Subgraph of the de Bruijn graph built from the real P. aeruginosa dataset (30×). Nodes and arcs are labelled with their (average) coverage. The fitted node and arc models are shown on the left. The nodes of the graph are coloured according to their true multiplicity. The neighbourhoods of size 1, 2 and 3 surrounding node n1 are shown as coloured ellipses. The inset table shows the categorical distribution P(Yn1) of the multiplicities {0,1,2} for node n1 for different neighbourhood sizes. For small neighbourhoods (sizes 0 and 1), node n1 (incorrectly) appears to represent a sequencing error due to its relatively low coverage. At neighbourhood sizes 2 and 3, the CRF model has enough information to (correctly) infer multiplicity 1 for node n1 with high probability
Fig. 4
Fig. 4
Part of the q-mer histogram of the real P. aeruginosa (25×) dataset. The red line shows the optimal coverage cutoff value. The purple shaded area represents nodes with a coverage below this cutoff value that have true multiplicity 1 whereas the green shaded area represents nodes with a coverage above the cutoff that have true multiplicity 0. These nodes represent false positives and false negatives respectively when using a cutoff-based method. In contrast, the CRF model is able to correctly classify the majority of nodes (lighter shaded purple and green areas)

Similar articles

Cited by

References

    1. Pevzner PA, Tang HX, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
    1. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai ZW, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng XQ, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000;287(5461):2196–204. doi: 10.1126/science.287.5461.2196. - DOI - PubMed
    1. Simpson JT, Pop M. The Theory and Practice of Genome Sequence Assembly In: Chakravarti A, Green E, editors. Annual Review of Genomics and Human Genetics, vol 16: 2015. p. 153–72. 10.1146/annurev-genom-090314-050032. - PubMed
    1. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014;30(24):3506–14. doi: 10.1093/bioinformatics/btu538. - DOI - PMC - PubMed
    1. Miclotte G, Heydari M, Demeester P, Rombauts S, de Peer Y, Audenaert P, Fostier J. Jabba: hybrid error correction for long sequencing reads. Algoritm Mol Biol. 2016; 11. 10.1186/s13015-016-0075-7. - PMC - PubMed

LinkOut - more resources