Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 15;30(12):i293-301.
doi: 10.1093/bioinformatics/btu266.

ExSPAnder: a universal repeat resolver for DNA fragment assembly

Affiliations

ExSPAnder: a universal repeat resolver for DNA fragment assembly

Andrey D Prjibelski et al. Bioinformatics. .

Abstract

Next-generation sequencing (NGS) technologies have raised a challenging de novo genome assembly problem that is further amplified in recently emerged single-cell sequencing projects. While various NGS assemblers can use information from several libraries of read-pairs, most of them were originally developed for a single library and do not fully benefit from multiple libraries. Moreover, most assemblers assume uniform read coverage, condition that does not hold for single-cell projects where utilization of read-pairs is even more challenging. We have developed an exSPAnder algorithm that accurately resolves repeats in the case of both single and multiple libraries of read-pairs in both standard and single-cell assembly projects.

Availability and implementation: http://bioinf.spbau.ru/en/spades

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Plots of the insert size distributions for B.faecium isolate (a) paired-end and (b) jumping library, and S.aureus single-cell dataset with (c) paired-end and (d) jumping library. The distributions were computed by mapping reads to the B.faecium str. DSM4810 (Lapidus et al., 2009) and S. aureus str. USA300 substr. FPR3757 (Diep et al., 2006) reference genomes, respectively. All plots are in the logarithmic scale
Fig. 2.
Fig. 2.
(a) Reads r and r′ form a read-pair mapping to consecutive edges e and e′ in the assembly graph at positions x0 and y0, respectively. (b) Representation of a read-pair (r,r′) as a point in a rectangle (e,e′). (c) ‘Ideal read-pairs’ with the exact insert size I connecting edges e and e′ form a 45° line within a rectangle. (d) Read-pairs from the real sequencing data with variations in the insert size represented as points within a rectangle. Most points are located within the confidence strip providing the evidence that edges e and e′ are supported by the read-pairs and are genome-consecutive. (e) A rectangle formed by a pair of edges that has few points falling into the confidence strip revealing that e and e′ are not genome-consecutive edges
Fig. 3.
Fig. 3.
(a) An example of an assembly graph with the genomic paths (p1, p2, p3, e1) and (p1,p2,p3,e2). (b, e) The composite rectangles for correct genomic extension of each path: in these cases the points are evenly distributed within the confidence strip and the resulting score is equal to 1. (c, d) The composite rectangles that correspond to incorrect extensions edges of these two paths. In each of these cases, at least one simple rectangle contains few points within the confidence strip
Fig. 4.
Fig. 4.
Scoring a path that contains repetitive edges. (a) An example of the assembly graph with a repetitive edge pr. (b) A composite rectangle for the correct extension e of path (p,pr). (c) A composite rectangle for the incorrect extension e′ of the path (p,pr)
Fig. 5.
Fig. 5.
An example of the assembly graph with repetitive edges p2 and p3
Fig. 6.
Fig. 6.
A step-by-step example of the exSPAnder algorithm. (a–c) Forming a set of active edges {e1, e2, e3} (marked red) for the path P = (p1, p2, p3) using the corresponding composite rectangles. (d, e) Classifying of edge p3 as repetitive and removing it from further consideration (marking gray). Edges that are not classified as repetitive are colored in blue. (f–h) Recalculating scores of the extension edges and updating the set of active edges. (i, j) Removing repetitive edge p2. (k–m) Recalculating scores for the remaining active edges {e1, e2} and removing e2 as non-active. (n) Selecting the only active edge e1 as an extension for the path P
Fig. 7.
Fig. 7.
An example of a composite rectangle formed by paths (p1, p2, p3) and (p1,p2,p3)
Fig. 8.
Fig. 8.
Plots of the false-positive (green) and false-negative (blue) rates for (a) B.facium and (b) S.aureus paired-end libraries

References

    1. Bankevich A, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 2012;19:455–477. - PMC - PubMed
    1. Boisvert S, et al. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol. 2010;17:1519–1533. - PMC - PubMed
    1. Bresler M, et al. Telescoper: de novo assembly of highly repetitive regions. Bioinformatics. 2012;28:311–317. - PMC - PubMed
    1. Butler J, et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Chitsaz H, et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat. Biotechnol. 2011;29:915–921. - PMC - PubMed

Publication types