Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002 Jan;12(1):177-89.
doi: 10.1101/gr.208902.

ARACHNE: a whole-genome shotgun assembler

Affiliations
Comparative Study

ARACHNE: a whole-genome shotgun assembler

Serafim Batzoglou et al. Genome Res. 2002 Jan.

Abstract

We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing approximately 10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded approximately 98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of approximately 1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Correcting errors in reads. A portion of a multiple alignment between five reads is shown. In the highlighted column of the alignment, a base T of quality 30 is aligned only to bases C, some of which are of quality greater than 30. The base T is changed to a base C of quality 0.
Figure 2
Figure 2
Using paired pairs of overlaps to merge reads. (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads (from the top) is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged.
Figure 3
Figure 3
Contig assembly. (A) How merging reads across the boundary of a repeat may result in a misassembly. Regions A, B, C, and D are unique regions, and region R is a repeat occurring twice in the genome. Reads x and y overlap in region R. Thus, regions A and D are wrongly joined after merging reads x and y. (B) A potential repeat boundary. Read r overlaps both reads x and y, but reads x and y do not overlap each other; they disagree in their rightmost ends. Here, a repeat R starting inside reads x and y and including the full read r is shown. In practice, sequencing errors rather than repeats often cause such patterns of overlap. (C) Contigs are created by merging reads up to the potential boundaries of repeats. A potential repeat boundary is any place where a read may be extended with two nonoverlapping reads. Two regions of the genome covered with reads are shown here. One region (A-R-D) is covered with solid line reads and the second region (C-R-B) with dotted line reads. The two regions meet in the repeat R creating five contigs: these are the unique contigs corresponding to unique sequences A, B, C, and D, and the repeat contig corresponding to the repeat R, where reads from both copies of R are overcollapsed into one contig. According to the algorithm used to construct contigs, the contig corresponding to R would have exactly the reads that are fully included in the boundaries of R. All the other reads would be assigned to contigs A, B, C, and D. (D) Sequencing errors. Read r dominates read y because the neighbors of y are all neighbors of r. This is caused by a sequencing error on y, which is marked in the figure. Note that if y represented correct sequence, it would likely be extended to the right by some read that did not overlap r, and thus r would not dominate y.
Figure 4
Figure 4
Detection of repeat contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such that A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B (if their reads do not overlap), then R is probably a repeat linking to two unique regions to the right.
Figure 5
Figure 5
Supercontig creation and gap filling. (A) A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, three contigs are joined into one supercontig. (B) ARACHNE attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs.
Figure 6
Figure 6
Types of misassemblies. (A) Three types of simple minor misassemblies are shown: insertions, deletions, and hanging ends. In all three cases, a contiguous segment (of a contig or the genome) of length less than 10 kb does not align in the expected location (with the genome or contig). This segment could be aligned at some alternate location in most cases, although we do not do this in practice. Compound minor misassemblies (e.g., contigs having two insertions) are reported as multiple misassembly events. (B) Two types of major misassemblies are shown. In the first type, two pieces of a contig align to distant parts of the genome (if one piece is very short, we instead report a hanging end, as in A). In the second type, adjacent contigs in a supercontig are aligned to distant parts of the genome. In practice, what we typically encounter is a hybrid between these two types: a contig that lies in the middle of a supercontig is split as in the first type. We call this hybrid the standard major misassembly.
Figure 7
Figure 7
Coverage in assemblies of 10-fold simulated reads. (A) Coverage of the genome with contigs. Contigs of sizes >250 kb cover 50%–70% of the genome. (B) Coverage of the genome with supercontigs. Supercontigs of size >1 Mb cover at least 65% of the genome in all test examples.
Figure 8
Figure 8
Partial alignments in the alignment module. Three partial alignments of length k = 6 between a pair of reads coalesce to yield a single full alignment of length k = 19. Vertical bars denote matching bases, whereas x's denote mismatches. This illustrates the commonly occurring situation where an extended k-mer hit is a full alignment between two reads (k = 6 is used in the figure for simplicity).
Figure 9
Figure 9
Detection of chimeric reads. Reads l1, l2, l3, r1, r2, and r3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. We call x the point of chimerism of c. Note that reads l3 and r3 extend slightly beyond x, as often happens for real chimeric reads.
Figure 10
Figure 10
Contig assembly. If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c) ≈ shift(a,c) − shift(a,b). We detect a repeat boundary toward the right of read a, if there is no overlap (b,c), nor any path of reads x1, …, xksuch that (b,x1), (x1,x2), …, (xk,c) are all overlaps, and shift(b,x1) +  + shift(xk,c) ≈ shift(a,c) − shift(a,b).
Figure 11
Figure 11
Consistency of forward-reverse links. (A) The distance d(A,B) (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward-reverse linked reads between them. (B) The distance d(B,C) between two contigs B,C that are linked to the same contig A, can be estimated from their respective distances to the linked contig.
Figure 12
Figure 12
Filling gaps in supercontigs. (A) Contigs A and B are connected by a path p of contigs X1, …, Xk. The distance dp(A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A or B. (B) Contigs Y1 and Y2 share forward-reverse links with the supercontig S. These links position them in the vicinity of the gap between A and B. Therefore, Y1 and Y2 will be used as possible stepping points in the path closing the gap from A to B.

References

    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
    1. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
    1. Batzoglou S. “Computational genomics: Mapping, comparison, and annotation of genomes.” Ph.D. dissertion. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science; 2000.
    1. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
    1. Chen, T. and Skiena, S.S. 1997. Trie-based data structures for sequence assembly, In: Proceedings of The Eighth Symposium on Combinatorial Pattern Matching, pp. 206–223. Springer-Verlag, New York.

Publication types

LinkOut - more resources