Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Niina Haiminen¹, David N Kuhn, Laxmi Parida, Isidore Rigoutsos

Affiliations

PMID: 21915294
PMCID: PMC3168497
DOI: 10.1371/journal.pone.0024182

Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Niina Haiminen et al. PLoS One. 2011.

. 2011;6(9):e24182.

doi: 10.1371/journal.pone.0024182. Epub 2011 Sep 7.

Authors

Niina Haiminen¹, David N Kuhn, Laxmi Parida, Isidore Rigoutsos

Affiliation

¹ IBM Thomas J. Watson Research Center, Yorktown Heights, New York, United States of America. nhaimin@us.ibm.com

PMID: 21915294
PMCID: PMC3168497
DOI: 10.1371/journal.pone.0024182

Erratum in

PLoS One. 2011:6(10). doi: 10.1371/annotation/bb125f93-80d3-4dd1-adfe-03d9fb740f3b
PLoS One. 2011;6(10). doi: 10.1371/annotation/176d83be-ed67-4205-9265-7208792d3dcf

Abstract

Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have read the journal's policy and have the following conflicts. Niina Haiminen and Laxmi Parida are employees of IBM. Isidore Rigoutsos and David N. Kuhn were employed by IBM at the time of the study. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

**Figure 1. Illustration of assembly errors.**
A single contig is aligned against the reference sequence, observed assembly mistakes are shown in red. The contig has 4 matches against the reference. Match 1 is the longest one, and it defines the match window coordinates and orientation. The first 157 positions in the match window do not have contig matches, corresponding to an insertion. Match 1 and Match 2 have overlap, corresponding to redundant positions in the reference. Match 2 is inverted, it has opposite orientation compared to the match window. The latter half of Match 3 is outside the match window, corresponding to a relocation. Match 4 and Match 3 have incorrect order, relative to their contig positions, this corresponds to a reordering. Gaps and redundancies are also shown on the reference sequence.

**Figure 2. N50 contig size and genome coverage.**
The best values among all the studied assembler on a given reference sequence and error rate are reported. a) N50 contig size is shown for all studied sequences, with different error rates for the 50 nt reads at 50× coverage. Sequences are ordered from smallest (HIV1) to largest (*S. cerevisiae*). BAC data has 30 nt reads with 0.6% error, its results shown under 0% error label. When N50 size is zero it indicates the sum of contig lengths was less than 50% of reference sequence length. b) Percentage of reference genome that is covered by the assembly is shown for all studied sequences.

**Figure 3. Correctness and size statistics for HIV1 assemblies.**
Assembly statistics are shown for all assemblers and various read error rates for HIV1 assemblies. a) Correctness scores, b) Coverage, and N50 divided by genome size.

**Figure 4. Correctness and size statistics for *O. sativa* assemblies with varying pair distances.**
Assembly statistics for a) correctness and b) size are shown for unpaired and paired *O. sativa* reads with distances {400, 1000, 3000} and 50× coverage, assembled by Velvet and ABySS.

See this image and copyright information in PMC

References

1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. 1977;74:5463–5467. - PMC - PubMed
1. Illumina website. Available: http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn. Accessed 2011 Aug 3.
1. Applied Biosystems website. Available: http://www.appliedbiosystems.com/absite/us/en/home/applications-technolo.... Accessed 2011 Aug 3.
1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. - PubMed
1. Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology. 2008;26:1135–1145. - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Affiliation

Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Authors

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources