Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar;22(3):549-56.
doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.

Efficient de novo assembly of large genomes using compressed data structures

Affiliations

Efficient de novo assembly of large genomes using compressed data structures

Jared T Simpson et al. Genome Res. 2012 Mar.

Abstract

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
High-level diagram of the SGA assembly pipeline. The assembly has three main stages: error correction, contig assembly, and scaffolding. The error correction stage starts by building an FM-index for the reads (sga index) then performing error correction (sga correct). The assembly stage takes the corrected reads as input, re-indexes them, removes duplicate and low-quality reads, then constructs contigs. The scaffolding stage realigns the original reads to the contigs using BWA, constructs a scaffold graph using the alignments, and outputs a final set of scaffolds in FASTA format. For clarity, minor steps of the pipeline have been omitted from the diagram.
Figure 2.
Figure 2.
Reference string coverage analysis for the C. elegans N2 assembly. For string lengths from 50 bp up to 5000 bp, 10,000 strings were sampled from the consensus-corrected C. elegans reference genome. The proportion of the strings found in the SGA, Velvet, ABySS, and SOAPdenovo assemblies is plotted.
Figure 3.
Figure 3.
The amount of the human reference genome covered by a contig as a function of the minimum contig alignment length. For each length L on the x-axis, contig alignments less than L bp in length were filtered out and the amount of the reference genome covered by the remaining alignments was calculated.

References

    1. Bauer MJ, Cox AJ, Rosone G 2011. Lightweight BWT construction for very large string collections. In Proceedings of the twenty-second annual symposium, Combinatorial Pattern Matching, pp. 219–231. Springer-Verlag, Berlin, Heidelberg
    1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59 - PMC - PubMed
    1. Boisvert S, Laviolette F, Corbeil J 2010. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol 17: 1519–1533 - PMC - PubMed
    1. Burrows M, Wheeler DJ 1994. A block-sorting lossless data compression algorithm. Digital SRC Research Report. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.6774
    1. C. elegans Sequencing Consortium 1998. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018 - PubMed

Publication types

LinkOut - more resources