De novo meta-assembly of ultra-deep sequencing data

Hamid Mirebrahim¹, Timothy J Close¹, Stefano Lonardi¹

Affiliations

PMID: 26072514
PMCID: PMC4765875
DOI: 10.1093/bioinformatics/btv226

De novo meta-assembly of ultra-deep sequencing data

Hamid Mirebrahim et al. Bioinformatics. 2015.

. 2015 Jun 15;31(12):i9-16.

doi: 10.1093/bioinformatics/btv226.

Authors

Hamid Mirebrahim¹, Timothy J Close¹, Stefano Lonardi¹

Affiliation

¹ Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA.

PMID: 26072514
PMCID: PMC4765875
DOI: 10.1093/bioinformatics/btv226

Abstract

We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized 'slices' and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler.

Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/.

PubMed Disclaimer

Figures

**Fig. 1.**
SLICEMBLER’s pipeline: First, the input reads are partitioned into smaller *slices* (1). Each slice is assembled individually (2), and the resulting assemblies are merged by a ‘majority voting’ process (3, 4). Before repeating these steps, any read in the input that maps to the consensus assembly is removed (6). When no further merging is possible, the final *consensus* assembly is produced (7)

**Fig. 2.**
Examples of *frequently occurring substrings* (FOS) from five assemblies (FOS can overlap)

**Fig. 3.**
Summary of assembly statistics on five barley BACs sequenced at 8000x. We compared Slicembler (using Velvet) with three alternative methods: Velvet on the entire dataset, Racer + Velvet on the entire dataset and the average performance of Velvet on the slices of 800 x each (see legend). Ground truth was based on Sanger-based assemblies. Statistics were collected with QUAST for contigs longer than 500 bp

**Fig. 4.**
An illustration of SLICEMBLER’s progressive construction of the consensus assembly for BACs 1, 2 and 3 (‘snapshots’ are taken every five iterations). Each box represents a perfect alignment between that contig and the reference. Light green boxes indicate a new FOS compared with the previous snapshot. Circles point to gaps closed or contig extended via the merging process (picture created with CLC sequence viewer)

**Fig. 5.**
The percentage of reads (y axis) at each iteration of Slicembler (x axis) that map exactly (i.e. zero mismatches/indels) to the reference on the five ultra-deep sequenced BACs

**Fig. 6.**
The effect of increasing sequencing error rates on the quality of assemblies created by Velvet and Slicembler + Velvet. Input paired-end reads were generated using wgsim with a coverage of 3000x using BAC 3 as a reference. For Slicembler, simulated read sets were divided into six slices. Statistics were collected with QUAST for contigs longer than 500 bp

See this image and copyright information in PMC

References

1. Aird D., et al. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18. - PMC - PubMed
1. Bankevich A., et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19,455–477. - PMC - PubMed
1. Beerenwinkel N., Zagordi O. (2011) Ultra-deep sequencing for the analysis of viral populations. Curr. Opin. Virol., 1, 413–418. - PubMed
1. Boisvert S., et al. (2010) Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol., 17, 1519–1533. - PMC - PubMed
1. Brown C.T., et al. (2012) A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv:1203.4802.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

De novo meta-assembly of ultra-deep sequencing data

Affiliation

De novo meta-assembly of ultra-deep sequencing data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources