SeqEntropy: genome-wide assessment of repeats for short read sequencing

Hsueh-Ting Chu¹, William W L Hsiao, Theresa T H Tsao, D Frank Hsu, Chaur-Chin Chen, Sheng-An Lee, Cheng-Yan Kao

Affiliations

PMID: 23544073
PMCID: PMC3609794
DOI: 10.1371/journal.pone.0059484

SeqEntropy: genome-wide assessment of repeats for short read sequencing

Hsueh-Ting Chu et al. PLoS One. 2013.

. 2013;8(3):e59484.

doi: 10.1371/journal.pone.0059484. Epub 2013 Mar 27.

Authors

Hsueh-Ting Chu¹, William W L Hsiao, Theresa T H Tsao, D Frank Hsu, Chaur-Chin Chen, Sheng-An Lee, Cheng-Yan Kao

Affiliation

¹ Department of Biomedical informatics, Asia University, Taichung, Taiwan.

PMID: 23544073
PMCID: PMC3609794
DOI: 10.1371/journal.pone.0059484

Abstract

Background: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths.

Methodology/principal findings: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.

Conclusions/significance: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Model of typical short read sequencing.**
(a) The target sequence is randomly broken into fragments and filtered by their lengths to form a sequencing library. (b) The end or ends of the DNA fragments are sequenced in parallel to generate a massive set of short reads. We assumed the sequencing is random so that each position is more or less covered by equal numbers of fixed-length reads.

**Figure 2. Entropy losses at different read lengths for different**
**organisms.** In the five organisms, the genomes of zebra fish (*D. rerio*) and fruit fly (*D. melanogaster*) will lose more entropy regardless of any read length used for sequencing. In particular, the fruit fly loses >2% of entropy loss even with read length of 120 bp. It will be <1% of entropy loss at read length of 230 bp. On the other hand, the genomes of Yeast (*S. cerevisiae*) and Nematode (*C. elegans*) have minor entropy loss even with very short reads. The detail results of entropy measurements are listed in **Table 5** .

**Figure 3. Histograms and quartile box plot of relative entropy losses in 2725 prokaryotic replicons.**
The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). The vast majority of the replicons lost <1% entropy regardless of the read length.

**Figure 4. Histograms and quartile box plot of entropy losses in 2725 prokaryotic replicons truncated at 1% entropy loss in order to see the finer breakdown.**
The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). It is clear that as read length increases, the entropy loss decreases. As a result, a higher number of replicons have ΔH <1.0%.

See this image and copyright information in PMC

References

1. Consortium TIHGS (2005) A haplotype map of the human genome. Nature 437: 1299–1320. - PMC - PubMed
1. Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073. - PMC - PubMed
1. Scientists GKCo (2009) Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered 100: 659–674. - PMC - PubMed
1. Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. - PMC - PubMed
1. Kingsford C, Schatz M, Pop M (2010) Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11: 21. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- BacDive

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SeqEntropy: genome-wide assessment of repeats for short read sequencing

Affiliation

SeqEntropy: genome-wide assessment of repeats for short read sequencing

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases