Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies
- PMID: 21486487
- PMCID: PMC3103447
- DOI: 10.1186/1471-2105-12-95
Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies
Abstract
Background: Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.
Results: We show that, in high-coverage prokaryotic assemblies, libraries of short mate-pairs (about 4-6 times the read-length) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' mate-pair libraries to accommodate the specific repeat structure of the genome being assembled - information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upper-bound on the potential value of mate-pairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning mate-pairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off -the-shelf assembler in the presence of base-call errors.
Conclusions: Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning mate-pair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a two-tiered approach - first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the mate-pair libraries that provide most information towards the resolution of repeats in the genome being assembled - is not only possible, but likely also more cost-effective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.
Figures





Similar articles
-
SMRT sequencing only de novo assembly of the sugar beet (Beta vulgaris) chloroplast genome.BMC Bioinformatics. 2015 Sep 16;16(1):295. doi: 10.1186/s12859-015-0726-6. BMC Bioinformatics. 2015. PMID: 26377912 Free PMC article.
-
Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.Gigascience. 2022 Dec 28;12:giad100. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24. Gigascience. 2022. PMID: 38000912 Free PMC article.
-
Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case.BMC Genomics. 2018 Dec 29;19(1):977. doi: 10.1186/s12864-018-5348-8. BMC Genomics. 2018. PMID: 30594129 Free PMC article.
-
Algorisms used for in silico finishing of bacterial genomes based on short-read assemblage implemented in GenoFinisher, AceFileViewer, and ShortReadManager.Biosci Biotechnol Biochem. 2022 May 24;86(6):693-703. doi: 10.1093/bbb/zbac032. Biosci Biotechnol Biochem. 2022. PMID: 35425950 Review.
-
Genetic variation and the de novo assembly of human genomes.Nat Rev Genet. 2015 Nov;16(11):627-40. doi: 10.1038/nrg3933. Epub 2015 Oct 7. Nat Rev Genet. 2015. PMID: 26442640 Free PMC article. Review.
Cited by
-
AGORA: Assembly Guided by Optical Restriction Alignment.BMC Bioinformatics. 2012 Aug 2;13:189. doi: 10.1186/1471-2105-13-189. BMC Bioinformatics. 2012. PMID: 22856673 Free PMC article.
-
A gene-by-gene population genomics platform: de novo assembly, annotation and genealogical analysis of 108 representative Neisseria meningitidis genomes.BMC Genomics. 2014 Dec 18;15(1):1138. doi: 10.1186/1471-2164-15-1138. BMC Genomics. 2014. PMID: 25523208 Free PMC article.
-
Single-Molecule Sequencing of the Drosophila serrata Genome.G3 (Bethesda). 2017 Mar 10;7(3):781-788. doi: 10.1534/g3.116.037598. G3 (Bethesda). 2017. PMID: 28143951 Free PMC article.
-
SeqEntropy: genome-wide assessment of repeats for short read sequencing.PLoS One. 2013;8(3):e59484. doi: 10.1371/journal.pone.0059484. Epub 2013 Mar 27. PLoS One. 2013. PMID: 23544073 Free PMC article.
-
The Complete Genome Sequence and Structure of the Oleaginous Rhodococcus opacus Strain PD630 Through Nanopore Technology.Front Bioeng Biotechnol. 2022 Feb 17;9:810571. doi: 10.3389/fbioe.2021.810571. eCollection 2021. Front Bioeng Biotechnol. 2022. PMID: 35252163 Free PMC article. No abstract available.
References
-
- Pevzner P, Tang H. Fragment assembly with double-barreled data. Bioinformatics. 2001;17(suppl 1):S225–233. - PubMed
-
- Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocyne J, Scott J, Shirley R, Liu L, Glodek A, Kelley J, Jenny M, Weidman J, Phillips C, Spriggs T, Hedblom E, Cotton M, Utterback T, Hanna M, Nguyen D, Saudek D, Brandon R, Fine L, Fritchman J, Fuhrmann J, Geoghagen N, Gnehm C, McDonald L, Small K, Fraser C, Smith H, Venter J. Whole-genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science. 1995;269(5223):496–512. doi: 10.1126/science.7542800. - DOI - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous