. 2012 Sep 15;28(18):i311-i317.

doi: 10.1093/bioinformatics/bts399.

Telescoper: de novo assembly of highly repetitive regions

Ma'ayan Bresler¹, Sara Sheehan, Andrew H Chan, Yun S Song

Affiliations

PMID: 22962446
PMCID: PMC3436826
DOI: 10.1093/bioinformatics/bts399

Telescoper: de novo assembly of highly repetitive regions

Ma'ayan Bresler et al. Bioinformatics. 2012.

. 2012 Sep 15;28(18):i311-i317.

doi: 10.1093/bioinformatics/bts399.

Authors

Ma'ayan Bresler¹, Sara Sheehan, Andrew H Chan, Yun S Song

Affiliation

¹ Department of EECS, University of California, Berkeley, CA 94720, USA.

PMID: 22962446
PMCID: PMC3436826
DOI: 10.1093/bioinformatics/bts399

Abstract

Motivation: With advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.

Results: In this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used.

Availability: Telescoper is publicly available for download at sourceforge.net/p/telescoper.

Contact: yss@eecs.berkeley.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
High-level description of the algorithm. Beginning with the seed string S₀, the algorithm iteratively performs the steps described to construct an e-graph data structure, from which a contig or contigs can be read. For simplicity, only a few example arcs are shown; in reality, red arcs are present between each consecutive pair of e-nodes, and orange arcs can be present between a given e-node and any of its preceding e-nodes

**Fig. 2.**
Illustration of Step 1 of Figure 1, finding an e-node S's possible extensions. (a) A read ‘cloud’ consists of those right-reads with left-mates that map to S. (b) The reads in the cloud are then error corrected and organized into a read-graph, which is in turn converted into a unitig graph. (c) Paths through the unitig graph correspond to possible extensions

**Fig. 3.**
Computing the expected number of left-reads mapping back from a unitig U₂ to the previous e-node S. (a) M_U₂ denotes the set of reads mapping from unitig U₂ to the previous e-node S. (b) For a right-read R_r located at position t in unitig U₂, the probability of its left-mate *R_l* mapping to S at a distance x behind U₂ is h(x + t), where h(·) is the expected insert distribution. (c) The expected number of reads at position x behind unitig U₂ is given by *f_U* (x) defined in Equation (1)

**Fig. 4.**
Illustration of Step 2 of Figure 1, scoring an e-node's possible extensions using short-insert read-pairs. (a) The penalty for unitig U₂ is 0 because no gaps of size ≥ℓ/2 exist (where ℓ is the read length). (b) The penalty for unitig U₃ is > 0 because a gap, denoted g, of size ≥ ℓ/2 exists. (c) The size of contig gap *g_c* is the distance between the reads that define the end and start of two adjacent unitigs

**Fig. 5.**
The cumulative proportion of all aligned contigs exceeding the contig size indicated on the x-axis. These plots illustrate the continuity and completeness of different assemblies. For any given minimum contig length, Telescoper produced more aligned bases. NG50 can be read from this graph as the x-coordinates at which each curve hits the 50% mark of bases output relative to the reference. (a) Results on simulated data D1. (b) Results on simulated data D2. (c) Results on real data D3

**Fig. 6.**
Contig continuity results for real data D3. The left and right telomeric regions (separated by the dotted line) for two different chromosomes are shown, with the aligned contigs displayed for each assembly algorithm. Different colours represent different contigs in the produced assembly, so more colours per method indicates a larger number of contigs. For each telomeric region shown, Telescoper produced a single contig for almost the entire region, while other algorithms often produced many small contigs

See this image and copyright information in PMC

References

1. Alkan C., et al. Limitations of next-generation genome sequence assembly. Nat. Methods. 2011;8:61–65. - PMC - PubMed
1. Ariyaratne P., Sung W. K. PE-assembler: de novo assembly using short paired-end reads. Bioinformatics. 2011;27:167–174. - PubMed
1. Chaisson M. J. P., et al. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed
1. Delcher A., et al. Fast algorithms for large-scale genome alignment and comparision. Nucleic Acids Res. 2002;30:2478–2483. - PMC - PubMed
1. Drmanac R., et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Telescoper: de novo assembly of highly repetitive regions

Affiliation

Telescoper: de novo assembly of highly repetitive regions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases