Efficient construction of an assembly string graph using the FM-index
- PMID: 20529929
- PMCID: PMC2881401
- DOI: 10.1093/bioinformatics/btq217
Efficient construction of an assembly string graph using the FM-index
Abstract
Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.
Results: Standard overlap assembly methods have time complexity O(N(2)), where N is the sum of the lengths of the reads. We use the Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler transform to find overlaps of length at least tau among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.
Figures


Similar articles
-
Readjoiner: a fast and memory efficient string graph-based sequence assembler.BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82. BMC Bioinformatics. 2012. PMID: 22559072 Free PMC article.
-
FSG: Fast String Graph Construction for De Novo Assembly.J Comput Biol. 2017 Oct;24(10):953-968. doi: 10.1089/cmb.2017.0089. Epub 2017 Jul 17. J Comput Biol. 2017. PMID: 28715269
-
String graph construction using incremental hashing.Bioinformatics. 2014 Dec 15;30(24):3515-23. doi: 10.1093/bioinformatics/btu578. Epub 2014 Sep 2. Bioinformatics. 2014. PMID: 25183486
-
The present and future of de novo whole-genome assembly.Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.
-
Genome sequence assembly algorithms and misassembly identification methods.Mol Biol Rep. 2022 Nov;49(11):11133-11148. doi: 10.1007/s11033-022-07919-8. Epub 2022 Sep 23. Mol Biol Rep. 2022. PMID: 36151399 Review.
Cited by
-
The Growing Importance of CNVs: New Insights for Detection and Clinical Interpretation.Front Genet. 2013 May 30;4:92. doi: 10.3389/fgene.2013.00092. eCollection 2013. Front Genet. 2013. PMID: 23750167 Free PMC article.
-
Complete Genome Sequence of Mycobacterium tuberculosis Clinical Isolate Spoligotype SIT745/EAI1-MYS.Genome Announc. 2016 May 19;4(3):e00323-16. doi: 10.1128/genomeA.00323-16. Genome Announc. 2016. PMID: 27198011 Free PMC article.
-
Readjoiner: a fast and memory efficient string graph-based sequence assembler.BMC Bioinformatics. 2012 May 6;13:82. doi: 10.1186/1471-2105-13-82. BMC Bioinformatics. 2012. PMID: 22559072 Free PMC article.
-
Using long and linked reads to improve an Atlantic herring (Clupea harengus) genome assembly.Sci Rep. 2019 Nov 27;9(1):17716. doi: 10.1038/s41598-019-54151-9. Sci Rep. 2019. PMID: 31776409 Free PMC article.
-
De novo characterization of the gene-rich transcriptomes of two color-polymorphic spiders, Theridion grallator and T. californicum (Araneae: Theridiidae), with special reference to pigment genes.BMC Genomics. 2013 Dec 8;14:862. doi: 10.1186/1471-2164-14-862. BMC Genomics. 2013. PMID: 24314324 Free PMC article.
References
-
- Bentley JL, Sedgewick R. SODA '97: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1997. Fast algorithms for sorting and searching strings; pp. 360–369.
-
- Burrows M, Wheeler DJ. Technical report 124. Palo Alto, CA: Digital Equipment Corporation; 1994. A block-sorting lossless data compression algorithm.
-
- Dementiev R, et al. Better external memory suffix array construction. J. Exp. Algorithmics. 2008;12:1–24.
-
- Ferragina P, Manzini G. Proceedings of the 41st Symposium on Foundations of Computer Science (FOCS 2000) Los Alamitos, CA, USA: IEEE Computer Society; 2000. Opportunistic data structures with applications; pp. 390–398.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials