Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;6(9):e24182.
doi: 10.1371/journal.pone.0024182. Epub 2011 Sep 7.

Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Affiliations

Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results

Niina Haiminen et al. PLoS One. 2011.

Erratum in

  • PLoS One. 2011:6(10). doi: 10.1371/annotation/bb125f93-80d3-4dd1-adfe-03d9fb740f3b
  • PLoS One. 2011;6(10). doi: 10.1371/annotation/176d83be-ed67-4205-9265-7208792d3dcf

Abstract

Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have read the journal's policy and have the following conflicts. Niina Haiminen and Laxmi Parida are employees of IBM. Isidore Rigoutsos and David N. Kuhn were employed by IBM at the time of the study. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1
Figure 1. Illustration of assembly errors.
A single contig is aligned against the reference sequence, observed assembly mistakes are shown in red. The contig has 4 matches against the reference. Match 1 is the longest one, and it defines the match window coordinates and orientation. The first 157 positions in the match window do not have contig matches, corresponding to an insertion. Match 1 and Match 2 have overlap, corresponding to redundant positions in the reference. Match 2 is inverted, it has opposite orientation compared to the match window. The latter half of Match 3 is outside the match window, corresponding to a relocation. Match 4 and Match 3 have incorrect order, relative to their contig positions, this corresponds to a reordering. Gaps and redundancies are also shown on the reference sequence.
Figure 2
Figure 2. N50 contig size and genome coverage.
The best values among all the studied assembler on a given reference sequence and error rate are reported. a) N50 contig size is shown for all studied sequences, with different error rates for the 50 nt reads at 50× coverage. Sequences are ordered from smallest (HIV1) to largest (S. cerevisiae). BAC data has 30 nt reads with 0.6% error, its results shown under 0% error label. When N50 size is zero it indicates the sum of contig lengths was less than 50% of reference sequence length. b) Percentage of reference genome that is covered by the assembly is shown for all studied sequences.
Figure 3
Figure 3. Correctness and size statistics for HIV1 assemblies.
Assembly statistics are shown for all assemblers and various read error rates for HIV1 assemblies. a) Correctness scores, b) Coverage, and N50 divided by genome size.
Figure 4
Figure 4. Correctness and size statistics for O. sativa assemblies with varying pair distances.
Assembly statistics for a) correctness and b) size are shown for unpaired and paired O. sativa reads with distances {400, 1000, 3000} and 50× coverage, assembled by Velvet and ABySS.

References

    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. 1977;74:5463–5467. - PMC - PubMed
    1. Illumina website. Available: http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn. Accessed 2011 Aug 3.
    1. Applied Biosystems website. Available: http://www.appliedbiosystems.com/absite/us/en/home/applications-technolo.... Accessed 2011 Aug 3.
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. - PubMed
    1. Shendure J, Ji H. Next-generation DNA sequencing. Nature Biotechnology. 2008;26:1135–1145. - PubMed