Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 15;31(12):i9-16.
doi: 10.1093/bioinformatics/btv226.

De novo meta-assembly of ultra-deep sequencing data

Affiliations

De novo meta-assembly of ultra-deep sequencing data

Hamid Mirebrahim et al. Bioinformatics. .

Abstract

We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized 'slices' and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler.

Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
SLICEMBLER’s pipeline: First, the input reads are partitioned into smaller slices (1). Each slice is assembled individually (2), and the resulting assemblies are merged by a ‘majority voting’ process (3, 4). Before repeating these steps, any read in the input that maps to the consensus assembly is removed (6). When no further merging is possible, the final consensus assembly is produced (7)
Fig. 2.
Fig. 2.
Examples of frequently occurring substrings (FOS) from five assemblies (FOS can overlap)
Fig. 3.
Fig. 3.
Summary of assembly statistics on five barley BACs sequenced at 8000x. We compared Slicembler (using Velvet) with three alternative methods: Velvet on the entire dataset, Racer + Velvet on the entire dataset and the average performance of Velvet on the slices of 800 x each (see legend). Ground truth was based on Sanger-based assemblies. Statistics were collected with QUAST for contigs longer than 500 bp
Fig. 4.
Fig. 4.
An illustration of SLICEMBLER’s progressive construction of the consensus assembly for BACs 1, 2 and 3 (‘snapshots’ are taken every five iterations). Each box represents a perfect alignment between that contig and the reference. Light green boxes indicate a new FOS compared with the previous snapshot. Circles point to gaps closed or contig extended via the merging process (picture created with CLC sequence viewer)
Fig. 5.
Fig. 5.
The percentage of reads (y axis) at each iteration of Slicembler (x axis) that map exactly (i.e. zero mismatches/indels) to the reference on the five ultra-deep sequenced BACs
Fig. 6.
Fig. 6.
The effect of increasing sequencing error rates on the quality of assemblies created by Velvet and Slicembler + Velvet. Input paired-end reads were generated using wgsim with a coverage of 3000x using BAC 3 as a reference. For Slicembler, simulated read sets were divided into six slices. Statistics were collected with QUAST for contigs longer than 500 bp

References

    1. Aird D., et al. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18. - PMC - PubMed
    1. Bankevich A., et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19,455–477. - PMC - PubMed
    1. Beerenwinkel N., Zagordi O. (2011) Ultra-deep sequencing for the analysis of viral populations. Curr. Opin. Virol., 1, 413–418. - PubMed
    1. Boisvert S., et al. (2010) Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol., 17, 1519–1533. - PMC - PubMed
    1. Brown C.T., et al. (2012) A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802.

Publication types

MeSH terms