Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec 22;4(12):e8407.
doi: 10.1371/journal.pone.0008407.

Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler

Affiliations

Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler

Daniel R Zerbino et al. PLoS One. .

Abstract

Background: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies.

Principal findings: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly.

Conclusions: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: EB has a small number of shares (<$5,000) in Illumina from previous consultancy with Solexa.

Figures

Figure 1
Figure 1. Construction of the secondary scaffold.
From a unique node A, Pebble starts by incorporating all the distances relative to this node found in the primary scaffold. For every unique node B which is connected to A, Pebble then follows the primary connections associated to B, thus flagging secondary neighbours of A. Assuming that all the nodes are laid out on a line, we can estimate that the distance from A to C is equal to the distance from A to B, minus that from B to A.
Figure 2
Figure 2. Resolution of repeats through the Rock Band assembly.
In this simplified diagram, contigs are represented as boxes, and long reads as thick curved lines. Instead of performing a pair-wise comparison of reads, the algorithm only examines the reads going out of unique node A. Two of the reads (1 and 2) go to node B. Node 3 is disregarded because it is not confirmed by another read. The algorithm then examines the reads going into node B. They all come from node A, except read 4, which is disregarded because unconfirmed, and read 5 which is not in contradiction with the assembly of contigs A and B. Finally, read 6, despite its overlap with the other reads, is disregarded throughout the analysis, as it goes through neither nodes A nor B.
Figure 3
Figure 3. Results of simulations using various insert lengths.
Final scaffold N50 as a function of insert length, in four different species and three different simulations scenarios. The horizontal red lines represent the initial N50 after error removal and before repeat resolution. The dashed blue lines represent the highest possible N50, namely the length of the sequence being sampled.
Figure 4
Figure 4. Results of simulations using various long/short read mixtures.
Final contig N50 as a function of long read concentration, in four different species, and three different simulation scenarios. The length of the long reads is represented by the colour of the curves: 100 (black) 200 (red) 400 (green) 500 (blue) and 1000 bp (light blue).
Figure 5
Figure 5. Comparison of the Rock Band and Pebble methods.
The red and black curves represent the final scaffold N50 after the execution of the Rock Band or Pebble algorithms respectively, as a function of long read or insert length, in four different species and three different simulations scenarios, as described in figures 3 and 4.

Similar articles

Cited by

References

    1. Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, et al. Whole-genome sequencing and variant discovery in C. elegans. Nature meth. 2008;5:183–188. - PubMed
    1. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, et al. Mapping and sequencing of structural variation form eight human genomes. Nature. 2008;453:56–64. - PMC - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–877. - PubMed

Publication types

MeSH terms