SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Juliane C Dohm¹, Claudio Lottaz, Tatiana Borodina, Heinz Himmelbauer

Affiliations

PMID: 17908823
PMCID: PMC2045152
DOI: 10.1101/gr.6435207

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Juliane C Dohm et al. Genome Res. 2007 Nov.

. 2007 Nov;17(11):1697-706.

doi: 10.1101/gr.6435207. Epub 2007 Oct 1.

Authors

Juliane C Dohm¹, Claudio Lottaz, Tatiana Borodina, Heinz Himmelbauer

Affiliation

¹ Max-Planck-Institute for Molecular Genetics, 14195 Berlin-Dahlem, Germany.

PMID: 17908823
PMCID: PMC2045152
DOI: 10.1101/gr.6435207

Abstract

The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25-40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1-4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.

PubMed Disclaimer

Figures

**Figure 1.**
Distributions of N50 values in relation to different read length illustrated using box plots. An N50 length of x kbp indicates that 50% of the source sequence is covered by contigs of x kbp or larger. The data was collected from assemblies under perfect assumptions (all possible reads are present exactly once, no sequencing errors), based on 20 BACs with an average size of 110 kbp for each species. The boxes stretch from the first to the third quartile of the respective distribution, thus covering 50% of the corresponding data. The boxes are split with a line at the median. For sake of clarity, the median value is given for each box plot. Whiskers indicated with dashed lines are 1.5 times longer than the box but do not stretch beyond minima and maxima. They are used to define the outliers, data points outside the range of the whiskers, which are marked as small circles.

**Figure 2.**
Detection of ambiguities during contig elongation despite missing read information. (A) Target sequence to assemble. Repetitive segments are shown in gray, unique sequence segments are shown in white and indicated with numbers. The positions of reads A–F and S are also shown in panel A. These reads are used to explain the principles of error-free contig extension in panels B–E. (B) Assembly under idealized assumptions (all reads present). Read S to be elongated to the *right*. An ambiguity is found, both reads B and D could extend read S; neither read D nor B will be assembled. (*C, D*) Assembly under real assumptions, but without robust contig extension. (C) Read S to be elongated to the *right*, read B is missing so that the ambiguity cannot be detected; read D would be assembled to S. (D) Naively built contig S₁ will be elongated to the far *right* end (not shown), then the elongation to the *left* starts, one crucial read is missing again (read C) and the ambiguity cannot be detected; read A would be assembled to contig S₁, wrongly connecting sequence segment 1 with sequence segment 3. (E) Assembly under real assumptions with SHARCGS’ robust contig extension. Different overlaps are checked before assembling D, and if any ambiguity is found for smaller overlaps the assembly of D will be rejected. Since both reads E and F could be assembled to the read S, read D will not be assembled to read S.

**Figure 3.**
Read filtering and selection prior to assembly to avoid inclusion of reads containing sequencing errors. (A) Proportion of correct reads that remain in the data set by applying two- to fourfold redundancy filtering (i.e., by counting reads present at least twice, threefold, or fourfold in the data set) depending on BAC pool size (average insert size per BAC 110 kbp). (B) Proportion of false reads that remain in the data set for assembly after application of two- to fourfold redundancy filtering. Data were simulated with 0.6% error rate.

**Figure 4.**
Dependency of N50 on BAC pool size for three different species (*A–C*). The data shown were collected from assemblies under realistic assumptions, with an error rate of 0.6% per base call and reads generated from randomly chosen positions. We mimic the BAC pool size by generating the proportional number of reads, i.e., for pool size n, we generate 5 million/n reads per BAC, corresponding to a coverage of 1363/n. BAC pool sizes (1–16) and corresponding coverages are indicated at the *bottom* of the figure. For each species we used 20 BACs with average length 110 kbp and simulated 30-mer data sets three times per BAC. Thus, for each pool size distribution, 60 assembly results are shown using box plots. Boxes stretch from the first to the third quartile of the respective distribution and cover 50% of the corresponding data. The median inside each box is indicated with a line. Whiskers (dashed lines) are 1.5 times longer than the box but do not stretch beyond minima and maxima and are used to define the outliers (small circles).

**Figure 5.**
Summary of decrease in N50 size due to pooling (for corresponding coverages, see Fig. 4). The plot compares N50 values for a given pool size with the optimal N50 achieved for pool size 1. The lines show the average fraction of N50 per pool size divided by the optimal (“best”) N50. For each pool size, we average over 20 BACs and three simulations each, using the same data as in Fig. 4.

**Figure 6.**
Description of the core assembly algorithm. (A) Pseudocode overview of the steps during assembly of a single contig. The parameter o_min controls the stringency of the algorithm, and r denotes the read length. (B) Illustration of the elongation step. Contig C is to be elongated to the *right*. Read R is a candidate for elongation found in the data set of reads, because its prefix (gray) matches the end of C perfectly. The suffix of read R (white) is the potential extension E for contig C. The length of the check region M is the sum of read length r, and the length of the extension E. Substrings of M and its reverse complement are used to search for matching read prefixes in the data set. Only if all of these reads match M exactly is C extended by E.

See this image and copyright information in PMC

References

1. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Li P.W., Hoskins R.A., Galle R.F., Hoskins R.A., Galle R.F., Galle R.F. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
1. Aparicio S., Chapman J., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chapman J., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Christoffels A., Rash S., Hoon S., Smit A., Rash S., Hoon S., Smit A., Hoon S., Smit A., Smit A., et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002;297:1301–1310. - PubMed
1. Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole-genome shotgun assembler. Genome Res. 2002;12:177–189. doi: 10.1101/gr.208902. - DOI - PMC - PubMed
1. Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
1. Choi S.D., Creelman R., Mullet J., Wing R.A., Creelman R., Mullet J., Wing R.A., Mullet J., Wing R.A., Wing R.A. Construction and characterization of a bacterial artificial chromosome library from Arabidopsis thaliana. Weeds World. 1995;2:17–20.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Affiliation

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Authors

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous