Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 15;28(18):i311-i317.
doi: 10.1093/bioinformatics/bts399.

Telescoper: de novo assembly of highly repetitive regions

Affiliations

Telescoper: de novo assembly of highly repetitive regions

Ma'ayan Bresler et al. Bioinformatics. .

Abstract

Motivation: With advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.

Results: In this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used.

Availability: Telescoper is publicly available for download at sourceforge.net/p/telescoper.

Contact: yss@eecs.berkeley.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
High-level description of the algorithm. Beginning with the seed string S0, the algorithm iteratively performs the steps described to construct an e-graph data structure, from which a contig or contigs can be read. For simplicity, only a few example arcs are shown; in reality, red arcs are present between each consecutive pair of e-nodes, and orange arcs can be present between a given e-node and any of its preceding e-nodes
Fig. 2.
Fig. 2.
Illustration of Step 1 of Figure 1, finding an e-node S's possible extensions. (a) A read ‘cloud’ consists of those right-reads with left-mates that map to S. (b) The reads in the cloud are then error corrected and organized into a read-graph, which is in turn converted into a unitig graph. (c) Paths through the unitig graph correspond to possible extensions
Fig. 3.
Fig. 3.
Computing the expected number of left-reads mapping back from a unitig U2 to the previous e-node S. (a) MU2 denotes the set of reads mapping from unitig U2 to the previous e-node S. (b) For a right-read Rr located at position t in unitig U2, the probability of its left-mate Rl mapping to S at a distance x behind U2 is h(x + t), where h(·) is the expected insert distribution. (c) The expected number of reads at position x behind unitig U2 is given by fU (x) defined in Equation (1)
Fig. 4.
Fig. 4.
Illustration of Step 2 of Figure 1, scoring an e-node's possible extensions using short-insert read-pairs. (a) The penalty for unitig U2 is 0 because no gaps of size ≥ℓ/2 exist (where is the read length). (b) The penalty for unitig U3 is > 0 because a gap, denoted g, of size ≥ ℓ/2 exists. (c) The size of contig gap gc is the distance between the reads that define the end and start of two adjacent unitigs
Fig. 5.
Fig. 5.
The cumulative proportion of all aligned contigs exceeding the contig size indicated on the x-axis. These plots illustrate the continuity and completeness of different assemblies. For any given minimum contig length, Telescoper produced more aligned bases. NG50 can be read from this graph as the x-coordinates at which each curve hits the 50% mark of bases output relative to the reference. (a) Results on simulated data D1. (b) Results on simulated data D2. (c) Results on real data D3
Fig. 6.
Fig. 6.
Contig continuity results for real data D3. The left and right telomeric regions (separated by the dotted line) for two different chromosomes are shown, with the aligned contigs displayed for each assembly algorithm. Different colours represent different contigs in the produced assembly, so more colours per method indicates a larger number of contigs. For each telomeric region shown, Telescoper produced a single contig for almost the entire region, while other algorithms often produced many small contigs

References

    1. Alkan C., et al. Limitations of next-generation genome sequence assembly. Nat. Methods. 2011;8:61–65. - PMC - PubMed
    1. Ariyaratne P., Sung W. K. PE-assembler: de novo assembly using short paired-end reads. Bioinformatics. 2011;27:167–174. - PubMed
    1. Chaisson M. J. P., et al. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed
    1. Delcher A., et al. Fast algorithms for large-scale genome alignment and comparision. Nucleic Acids Res. 2002;30:2478–2483. - PMC - PubMed
    1. Drmanac R., et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. - PubMed

Publication types