Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr 13:12:95.
doi: 10.1186/1471-2105-12-95.

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Affiliations

Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies

Joshua Wetzel et al. BMC Bioinformatics. .

Abstract

Background: Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.

Results: We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.

Conclusions: Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An assembly 'bubble'. An assembly 'bubble' that complicates repeat resolution with mate-pairs. Shaded nodes are non-decision nodes (in- and out-degree equal to 1). The nodes R1 and R2 are decision nodes (repeats). There are two possible paths of the same length from one end of the mate-pair to the other (black nodes), leading to ambiguity in the graph traversal.
Figure 2
Figure 2
C-statistic across 391 bacterial genomes. The percentage of short-read (35, 50, and 100-mer) graphs with C-Statistic in a particular range. Here we see that in about 60% of the graphs created from short-reads, 60-90% of the finishing complexity is contained in repeats that are difficult to resolve using mate-pair information.
Figure 3
Figure 3
'Ideal' mate-pair lengths across 391 bacterial genomes. Average 'tuned' insert sizes for 35, 50, and 100-mer graphs, separated by C-Statistic. As can be seen, those graphs with a C-Statistic higher than 50 had a nearly uniform distribution and short insert sizes, while those with a lower C-Statistic had longer inserts and higher variance. Error bars represent standard deviation.
Figure 4
Figure 4
Reduction in finishing complexity for 'tuned' vs. standard mate-pairs. Graphs on left (A) depict the mean percent reduction in finishing complexity on graphs constructed from a particular k-mer size given a set of ideal ('tuned') mate-pair libraries (grouped according to the C-Statistic). The ideal libraries are graph-specific, but the smaller of the two libraries averaged between 4.5 k and 6 k, where k is the original k-mer size (read length) used to construct the graph. Graphs on right (B) depict the same statistics when using a mixture of two long libraries (2000 and 8000 bp).
Figure 5
Figure 5
A simplified de Bruijn graph. A small de Bruijn assembly graph after the simplification process is complete. The nodes R1, R2, R3, and R4 are repeats (decision nodes), and the shaded nodes are non-decision nodes. Note that non-decision nodes can only reside sandwiched between repeats.

Similar articles

Cited by

References

    1. Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat Meth. 2011;8:61–65. doi: 10.1038/nmeth.1527. - DOI - PMC - PubMed
    1. Kingsford C, Schatz M, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics. 2010;11:21. doi: 10.1186/1471-2105-11-21. - DOI - PMC - PubMed
    1. Pevzner P, Tang H. Fragment assembly with double-barreled data. Bioinformatics. 2001;17(suppl 1):S225–233. - PubMed
    1. Pevzner P, Tang H, Waterman M. An Eulerian Path Approach to DNA Fragment Assembly. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
    1. Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocyne J, Scott J, Shirley R, Liu L, Glodek A, Kelley J, Jenny M, Weidman J, Phillips C, Spriggs T, Hedblom E, Cotton M, Utterback T, Hanna M, Nguyen D, Saudek D, Brandon R, Fine L, Fritchman J, Fuhrmann J, Geoghagen N, Gnehm C, McDonald L, Small K, Fraser C, Smith H, Venter J. Whole-genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science. 1995;269(5223):496–512. doi: 10.1126/science.7542800. - DOI - PubMed

Publication types

MeSH terms