SeqEntropy: genome-wide assessment of repeats for short read sequencing
- PMID: 23544073
- PMCID: PMC3609794
- DOI: 10.1371/journal.pone.0059484
SeqEntropy: genome-wide assessment of repeats for short read sequencing
Abstract
Background: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths.
Methodology/principal findings: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.
Conclusions/significance: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.
Conflict of interest statement
Figures




Similar articles
-
Read length and repeat resolution: exploring prokaryote genomes using next-generation sequencing technologies.PLoS One. 2010 Jul 12;5(7):e11518. doi: 10.1371/journal.pone.0011518. PLoS One. 2010. PMID: 20634954 Free PMC article.
-
Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler.PLoS One. 2009 Dec 22;4(12):e8407. doi: 10.1371/journal.pone.0008407. PLoS One. 2009. PMID: 20027311 Free PMC article.
-
454 sequencing put to the test using the complex genome of barley.BMC Genomics. 2006 Oct 26;7:275. doi: 10.1186/1471-2164-7-275. BMC Genomics. 2006. PMID: 17067373 Free PMC article.
-
PacBio Sequencing and Its Applications.Genomics Proteomics Bioinformatics. 2015 Oct;13(5):278-89. doi: 10.1016/j.gpb.2015.08.002. Epub 2015 Nov 2. Genomics Proteomics Bioinformatics. 2015. PMID: 26542840 Free PMC article. Review.
-
Oxford Nanopore MinION Sequencing and Genome Assembly.Genomics Proteomics Bioinformatics. 2016 Oct;14(5):265-279. doi: 10.1016/j.gpb.2016.05.004. Epub 2016 Sep 17. Genomics Proteomics Bioinformatics. 2016. PMID: 27646134 Free PMC article. Review.
Cited by
-
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.BMC Bioinformatics. 2014 Jan 3;15:2. doi: 10.1186/1471-2105-15-2. BMC Bioinformatics. 2014. PMID: 24386976 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases