. 2011 Apr 13:12:95.

doi: 10.1186/1471-2105-12-95.

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Joshua Wetzel¹, Carl Kingsford, Mihai Pop

Affiliations

PMID: 21486487
PMCID: PMC3103447
DOI: 10.1186/1471-2105-12-95

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Joshua Wetzel et al. BMC Bioinformatics. 2011.

. 2011 Apr 13:12:95.

doi: 10.1186/1471-2105-12-95.

Authors

Joshua Wetzel¹, Carl Kingsford, Mihai Pop

Affiliation

¹ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.

PMID: 21486487
PMCID: PMC3103447
DOI: 10.1186/1471-2105-12-95

Abstract

Background: Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.

Results: We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.

Conclusions: Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.

PubMed Disclaimer

Figures

**Figure 1**
**An assembly 'bubble'**. An assembly 'bubble' that complicates repeat resolution with mate-pairs. Shaded nodes are non-decision nodes (in- and out-degree equal to 1). The nodes R1 and R2 are decision nodes (repeats). There are two possible paths of the same length from one end of the mate-pair to the other (black nodes), leading to ambiguity in the graph traversal.

**Figure 2**
**C-statistic across 391 bacterial genomes**. The percentage of short-read (35, 50, and 100-mer) graphs with C-Statistic in a particular range. Here we see that in about 60% of the graphs created from short-reads, 60-90% of the finishing complexity is contained in repeats that are difficult to resolve using mate-pair information.

**Figure 3**
**'Ideal' mate-pair lengths across 391 bacterial genomes**. Average 'tuned' insert sizes for 35, 50, and 100-mer graphs, separated by C-Statistic. As can be seen, those graphs with a C-Statistic higher than 50 had a nearly uniform distribution and short insert sizes, while those with a lower C-Statistic had longer inserts and higher variance. Error bars represent standard deviation.

**Figure 4**
**Reduction in finishing complexity for 'tuned' vs. standard mate-pairs**. Graphs on left (A) depict the mean percent reduction in finishing complexity on graphs constructed from a particular k-mer size given a set of ideal ('tuned') mate-pair libraries (grouped according to the C-Statistic). The ideal libraries are graph-specific, but the smaller of the two libraries averaged between 4.5 k and 6 k, where k is the original k-mer size (read length) used to construct the graph. Graphs on right (B) depict the same statistics when using a mixture of two long libraries (2000 and 8000 bp).

**Figure 5**
**A simplified de Bruijn graph**. A small de Bruijn assembly graph after the simplification process is complete. The nodes R1, R2, R3, and R4 are repeats (decision nodes), and the shaded nodes are non-decision nodes. Note that non-decision nodes can only reside sandwiched between repeats.

See this image and copyright information in PMC

References

1. Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Meth. 2011;8:61–65. doi: 10.1038/nmeth.1527. - DOI - PMC - PubMed
1. Kingsford C, Schatz M, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 2010;11:21. doi: 10.1186/1471-2105-11-21. - DOI - PMC - PubMed
1. Pevzner P, Tang H. Fragment assembly with double-barreled data. Bioinformatics. 2001;17(suppl 1):S225–233. - PubMed
1. Pevzner P, Tang H, Waterman M. An Eulerian Path Approach to DNA Fragment Assembly. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
1. Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocyne J, Scott J, Shirley R, Liu L, Glodek A, Kelley J, Jenny M, Weidman J, Phillips C, Spriggs T, Hedblom E, Cotton M, Utterback T, Hanna M, Nguyen D, Saudek D, Brandon R, Fine L, Fritchman J, Fuhrmann J, Geoghagen N, Gnehm C, McDonald L, Small K, Fraser C, Smith H, Venter J. Whole-genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science. 1995;269(5223):496–512. doi: 10.1126/science.7542800. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R21 AI085376/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Affiliation

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous