Comparative Study

. 2002 Jan;12(1):177-89.

doi: 10.1101/gr.208902.

ARACHNE: a whole-genome shotgun assembler

Serafim Batzoglou¹, David B Jaffe, Ken Stanley, Jonathan Butler, Sante Gnerre, Evan Mauceli, Bonnie Berger, Jill P Mesirov, Eric S Lander

Affiliations

PMID: 11779843
PMCID: PMC155255
DOI: 10.1101/gr.208902

Comparative Study

ARACHNE: a whole-genome shotgun assembler

Serafim Batzoglou et al. Genome Res. 2002 Jan.

. 2002 Jan;12(1):177-89.

doi: 10.1101/gr.208902.

Authors

Serafim Batzoglou¹, David B Jaffe, Ken Stanley, Jonathan Butler, Sante Gnerre, Evan Mauceli, Bonnie Berger, Jill P Mesirov, Eric S Lander

Affiliation

¹ Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.

PMID: 11779843
PMCID: PMC155255
DOI: 10.1101/gr.208902

Abstract

We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing approximately 10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded approximately 98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of approximately 1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

PubMed Disclaimer

Figures

**Figure 1**
Correcting errors in reads. A portion of a multiple alignment between five reads is shown. In the highlighted column of the alignment, a base T of quality 30 is aligned only to bases C, some of which are of quality greater than 30. The base T is changed to a base C of quality 0.

**Figure 2**
Using paired pairs of overlaps to merge reads. (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads (from the top) is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged.

**Figure 3**
Contig assembly. (A) How merging reads across the boundary of a repeat may result in a misassembly. Regions *A, B*, C, and D are unique regions, and region R is a repeat occurring twice in the genome. Reads x and y overlap in region R. Thus, regions A and D are wrongly joined after merging reads x and y. (B) A potential repeat boundary. Read r overlaps both reads x and y, but reads x and y do not overlap each other; they disagree in their rightmost ends. Here, a repeat R starting inside reads x and y and including the full read r is shown. In practice, sequencing errors rather than repeats often cause such patterns of overlap. (C) Contigs are created by merging reads up to the potential boundaries of repeats. A potential repeat boundary is any place where a read may be extended with two nonoverlapping reads. Two regions of the genome covered with reads are shown here. One region *(A-R-D)* is covered with solid line reads and the second region *(C-R-B)* with dotted line reads. The two regions meet in the repeat R creating five contigs: these are the unique contigs corresponding to unique sequences *A, B, C*, and D, and the repeat contig corresponding to the repeat R, where reads from both copies of R are overcollapsed into one contig. According to the algorithm used to construct contigs, the contig corresponding to R would have exactly the reads that are fully included in the boundaries of R. All the other reads would be assigned to contigs *A, B, C*, and D. (D) Sequencing errors. Read r dominates read y because the neighbors of y are all neighbors of r. This is caused by a sequencing error on y, which is marked in the figure. Note that if y represented correct sequence, it would likely be extended to the right by some read that did not overlap r, and thus r would not dominate y.

**Figure 4**
Detection of repeat contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such that A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B (if their reads do not overlap), then R is probably a repeat linking to two unique regions to the right.

**Figure 5**
Supercontig creation and gap filling. (A) A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, three contigs are joined into one supercontig. (B) ARACHNE attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs.

**Figure 6**
Types of misassemblies. (A) Three types of simple minor misassemblies are shown: insertions, deletions, and hanging ends. In all three cases, a contiguous segment (of a contig or the genome) of length less than 10 kb does not align in the expected location (with the genome or contig). This segment could be aligned at some alternate location in most cases, although we do not do this in practice. Compound minor misassemblies (e.g., contigs having two insertions) are reported as multiple misassembly events. (B) Two types of major misassemblies are shown. In the first type, two pieces of a contig align to distant parts of the genome (if one piece is very short, we instead report a hanging end, as in A). In the second type, adjacent contigs in a supercontig are aligned to distant parts of the genome. In practice, what we typically encounter is a hybrid between these two types: a contig that lies in the middle of a supercontig is split as in the first type. We call this hybrid the standard major misassembly.

**Figure 7**
Coverage in assemblies of 10-fold simulated reads. (A) Coverage of the genome with contigs. Contigs of sizes >250 kb cover 50%–70% of the genome. (B) Coverage of the genome with supercontigs. Supercontigs of size >1 Mb cover at least 65% of the genome in all test examples.

**Figure 8**
Partial alignments in the alignment module. Three partial alignments of length k = 6 between a pair of reads coalesce to yield a single full alignment of length k = 19. Vertical bars denote matching bases, whereas x's denote mismatches. This illustrates the commonly occurring situation where an extended k-mer hit is a full alignment between two reads (k = 6 is used in the figure for simplicity).

**Figure 9**
Detection of chimeric reads. Reads l₁, l₂, l₃, r₁, r₂, and r₃, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. We call x the *point of chimerism* of c. Note that reads l₃ and r₃ extend slightly beyond x, as often happens for real chimeric reads.

**Figure 10**
Contig assembly. If *(a,b)* and *(a,c)* overlap, then *(b,c)* are expected to overlap. Moreover, one can calculate that *shift(b,c)* ≈ *shift(a,c)* − *shift(a,b)*. We detect a repeat boundary toward the right of read a, if there is no overlap *(b,c)*, nor any path of reads *x₁, …, x_k*such that *(b,x₁), (x₁,x₂), …, (x_k,c)* are all overlaps, and *shift(b,x₁)* + … + *shift(x_k,c)* ≈ *shift(a,c)* − *shift(a,b)*.

**Figure 11**
Consistency of forward-reverse links. (A) The distance *d(A,B)* (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward-reverse linked reads between them. (B) The distance *d(B,C)* between two contigs *B,C* that are linked to the same contig A, can be estimated from their respective distances to the linked contig.

**Figure 12**
Filling gaps in supercontigs. (A) Contigs A and B are connected by a path p of contigs *X₁, …, X_k*. The distance *d_p(A,B)* between A and B (along the path p) is the length of the sequence in the path that does not overlap A or B. (B) Contigs Y₁ and Y₂ share forward-reverse links with the supercontig S. These links position them in the vicinity of the gap between A and B. Therefore, Y₁ and Y₂ will be used as possible stepping points in the path closing the gap from A to B.

See this image and copyright information in PMC

References

1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
1. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
1. Batzoglou S. “Computational genomics: Mapping, comparison, and annotation of genomes.” Ph.D. dissertion. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science; 2000.
1. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
1. Chen, T. and Skiena, S.S. 1997. Trie-based data structures for sequence assembly, In: Proceedings of The Eighth Symposium on Combinatorial Pattern Matching, pp. 206–223. Springer-Verlag, New York.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ARACHNE: a whole-genome shotgun assembler

Affiliation

ARACHNE: a whole-genome shotgun assembler

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous