Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Mar;22(3):557-67.
doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Affiliations

GAGE: A critical evaluation of genome assemblies and assembly algorithms

Steven L Salzberg et al. Genome Res. 2012 Mar.

Erratum in

  • Genome Res. 2012 Jun;22(6):1196

Abstract

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Comparison of the indel profiles for three assemblies of human Chr14. Every indel in the assembly is defined by the two aligned segments on either side. For each indel, the x-axis displays the distance between the two adjacent segments in the reference, and the y-axis displays the distance in the query. Thus, the point x = 100, y = 0 indicates a 100-bp deletion in the assembly, relative to the reference. Deletions from the assembly lie below the line y = x, and insertions in the assembly lie above. The indels can be roughly categorized by quadrant: (top right) divergent sequence; (bottom right) segmental assembly deletion; (bottom left) tandem repeat collapse/expansion; (top left) segmental assembly insertion. No points lie on the line y = x because only indels >5 bp are displayed. For details, see the Supplemental Methods.
Figure 2.
Figure 2.
A dot-plot comparison of the SOAPdenovo and Velvet scaffolds of R. sphaeroides. The finished reference chromosomes are plotted on the x-axis and the assembly scaffolds on the y-axis. Dotted lines indicate scaffold or chromosome boundaries. The apparent rearrangement at the top right of the SOAPdenovo plot is an artifact of the circular reference plasmid.
Figure 3.
Figure 3.
Assemblies of R. sphaeroides using four different combinations of paired-end libraries as input to the assemblers. Each run used either one library (180 bp only) or a different combination of two libraries from 180 to 3000 bp. Note that N50 values are uncorrected; see Table 3 for the true N50 sizes for the 180 bp + 3 kb combination, which are much lower in some instances; e.g., SOAPdenovo has a corrected N50 of 14.3 kb (rather than 131.7 kb) for assembly with the 180-bp and 3-kb libraries.
Figure 4.
Figure 4.
K-mer uniqueness ratio for the three genomes assembled in GAGE: the bacteria S. aureus and R. sphaeroides and human chromosome 14. The ratio is defined as the percentage of a genome that is covered by unique (i.e., non-repetitive) DNA sequences of length K. Shown for comparison are the k-mer uniqueness ratios for the full human genome and for the nematode C. elegans.
Figure 5.
Figure 5.
Comparison of insertion and deletion errors among all eight assemblers for human chromosome 14. (Blue) The indel errors >5 bp in length that are unique to each assembler. (Red bars) Indel errors made by at least one other assembler. (Green bars) Indels shared by all assemblers, which might represent true differences between the target genome and the reference.
Figure 6.
Figure 6.
Average contig (A) and scaffold (B) sizes, measured by N50 values, versus error rates, averaged over all three genomes for which the true assembly is known: S. aureus, R. sphaeroides, and human chromosome 14. Errors (vertical axis) are measured as the average distance between errors, in kilobases. N50 values represent the size N at which 50% of the genome is contained in contigs/scaffolds of length N or larger. In both plots, the best assemblers appear in the upper right.

Similar articles

Cited by

References

    1. Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Blomberg Le A, Bouffard P, Burt DW, Crasta O, Crooijmans RP, et al. 2010. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): Genome assembly and analysis. PLoS Biol 8: e1000475 doi: 10.1371/journal.pbio.1000475 - PMC - PubMed
    1. Dubchak I, Poliakov A, Kislyuk A, Brudno M 2009. Multiple whole-genome alignments without a reference organism. Genome Res 19: 682–689 - PMC - PubMed
    1. Earl DA, Bradnam K, St John J, Darling A, Lin D, Faas J, Yu HO, Vince B, Zerbino DR, Diekhans M, et al. 2011. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Res 21: 2224–2241 - PMC - PubMed
    1. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci 108: 1513–1518 - PMC - PubMed
    1. Ju YS, Kim JI, Kim S, Hong D, Park H, Shin JY, Lee S, Lee WC, Yu SB, Park SS, et al. 2011. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals. Nat Genet 43: 745–752 - PubMed

Publication types