Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 4;19(1):311.
doi: 10.1186/s12859-018-2319-7.

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs

Affiliations

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs

Mahdi Heydari et al. BMC Bioinformatics. .

Abstract

Background: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string.

Results: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure.

Conclusions: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at: https://github.com/biointec/browniealigner.

Keywords: Graph alignment; Illumina; Markov Model; Next-generation sequencing; de Bruijn Graph.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
This figure shows the association between the de Bruijn graph and MM tables. On the left side, part of a de Bruijn graph is shown. True paths are depicted by blue lines. The numbers inside each node indicate the multiplicity of that node, i.e., the number of times the node’s sequence is present in the reference genome. A table at each node guides the aligner based on previously observed nodes. The 2-MM and 3-MM tables of node A are shown on the right side. Based on the 2-MM table, reads that align to CA are guided to E as the continuation to node D is not allowed. However, the information in this table is insufficient to guide reads that align to BA since continuations to E and D are both valid. In contrast, the 3-MM table guides the reads that align to FBA to D, and GBA to E. The information in the final row in 3-MM table is redundant because it is also contained in the lower-order 2-MM table
Fig. 2
Fig. 2
Peak memory usage. Peak memory usage of the aligner tools for simulated datasets
Fig. 3
Fig. 3
Runtime. Average runtime of tools to align 1M reads for the simulated datasets
Fig. 4
Fig. 4
Runtime. The effect of branch and bound strategy on the running time of BrownieAligner
Fig. 5
Fig. 5
Peak memory usage. Peak memory usage of the aligner tools for real datasets
Fig. 6
Fig. 6
Runtime. Average runtime of tools to align 1M reads for the real datasets

Similar articles

Cited by

References

    1. Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biol. 2011;12(11):R112. doi: 10.1186/gb-2011-12-11-r112. - DOI - PMC - PubMed
    1. Compeau PEC, Pevzner PA, Tesler G. How to apply de bruijn graphs to genome assembly. Nat Biotechnol. 2011;29(11):987–91. doi: 10.1038/nbt.2023. - DOI - PMC - PubMed
    1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52. doi: 10.1038/nbt.1883. - DOI - PMC - PubMed
    1. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc Natl Acad Sci. 2012;109(33):13272–7. doi: 10.1073/pnas.1121464109. - DOI - PMC - PubMed
    1. Tattini L, D’Aurizio R, Magi A. Detection of genomic structural variants from next-generation sequencing data. Front Bioeng Biotechnol. 2015;3(June):1–8. - PMC - PubMed

LinkOut - more resources