Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Nov;17(11):1697-706.
doi: 10.1101/gr.6435207. Epub 2007 Oct 1.

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Affiliations

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Juliane C Dohm et al. Genome Res. 2007 Nov.

Abstract

The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25-40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1-4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Distributions of N50 values in relation to different read length illustrated using box plots. An N50 length of x kbp indicates that 50% of the source sequence is covered by contigs of x kbp or larger. The data was collected from assemblies under perfect assumptions (all possible reads are present exactly once, no sequencing errors), based on 20 BACs with an average size of 110 kbp for each species. The boxes stretch from the first to the third quartile of the respective distribution, thus covering 50% of the corresponding data. The boxes are split with a line at the median. For sake of clarity, the median value is given for each box plot. Whiskers indicated with dashed lines are 1.5 times longer than the box but do not stretch beyond minima and maxima. They are used to define the outliers, data points outside the range of the whiskers, which are marked as small circles.
Figure 2.
Figure 2.
Detection of ambiguities during contig elongation despite missing read information. (A) Target sequence to assemble. Repetitive segments are shown in gray, unique sequence segments are shown in white and indicated with numbers. The positions of reads A–F and S are also shown in panel A. These reads are used to explain the principles of error-free contig extension in panels BE. (B) Assembly under idealized assumptions (all reads present). Read S to be elongated to the right. An ambiguity is found, both reads B and D could extend read S; neither read D nor B will be assembled. (C, D) Assembly under real assumptions, but without robust contig extension. (C) Read S to be elongated to the right, read B is missing so that the ambiguity cannot be detected; read D would be assembled to S. (D) Naively built contig S1 will be elongated to the far right end (not shown), then the elongation to the left starts, one crucial read is missing again (read C) and the ambiguity cannot be detected; read A would be assembled to contig S1, wrongly connecting sequence segment 1 with sequence segment 3. (E) Assembly under real assumptions with SHARCGS’ robust contig extension. Different overlaps are checked before assembling D, and if any ambiguity is found for smaller overlaps the assembly of D will be rejected. Since both reads E and F could be assembled to the read S, read D will not be assembled to read S.
Figure 3.
Figure 3.
Read filtering and selection prior to assembly to avoid inclusion of reads containing sequencing errors. (A) Proportion of correct reads that remain in the data set by applying two- to fourfold redundancy filtering (i.e., by counting reads present at least twice, threefold, or fourfold in the data set) depending on BAC pool size (average insert size per BAC 110 kbp). (B) Proportion of false reads that remain in the data set for assembly after application of two- to fourfold redundancy filtering. Data were simulated with 0.6% error rate.
Figure 4.
Figure 4.
Dependency of N50 on BAC pool size for three different species (A–C). The data shown were collected from assemblies under realistic assumptions, with an error rate of 0.6% per base call and reads generated from randomly chosen positions. We mimic the BAC pool size by generating the proportional number of reads, i.e., for pool size n, we generate 5 million/n reads per BAC, corresponding to a coverage of 1363/n. BAC pool sizes (1–16) and corresponding coverages are indicated at the bottom of the figure. For each species we used 20 BACs with average length 110 kbp and simulated 30-mer data sets three times per BAC. Thus, for each pool size distribution, 60 assembly results are shown using box plots. Boxes stretch from the first to the third quartile of the respective distribution and cover 50% of the corresponding data. The median inside each box is indicated with a line. Whiskers (dashed lines) are 1.5 times longer than the box but do not stretch beyond minima and maxima and are used to define the outliers (small circles).
Figure 5.
Figure 5.
Summary of decrease in N50 size due to pooling (for corresponding coverages, see Fig. 4). The plot compares N50 values for a given pool size with the optimal N50 achieved for pool size 1. The lines show the average fraction of N50 per pool size divided by the optimal (“best”) N50. For each pool size, we average over 20 BACs and three simulations each, using the same data as in Fig. 4.
Figure 6.
Figure 6.
Description of the core assembly algorithm. (A) Pseudocode overview of the steps during assembly of a single contig. The parameter omin controls the stringency of the algorithm, and r denotes the read length. (B) Illustration of the elongation step. Contig C is to be elongated to the right. Read R is a candidate for elongation found in the data set of reads, because its prefix (gray) matches the end of C perfectly. The suffix of read R (white) is the potential extension E for contig C. The length of the check region M is the sum of read length r, and the length of the extension E. Substrings of M and its reverse complement are used to search for matching read prefixes in the data set. Only if all of these reads match M exactly is C extended by E.

Similar articles

Cited by

References

    1. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., Li P.W., Hoskins R.A., Galle R.F., Hoskins R.A., Galle R.F., Galle R.F. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
    1. Aparicio S., Chapman J., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chapman J., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Stupka E., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Putnam N., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Chia J.M., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Dehal P., Christoffels A., Rash S., Hoon S., Smit A., Christoffels A., Rash S., Hoon S., Smit A., Rash S., Hoon S., Smit A., Hoon S., Smit A., Smit A., et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002;297:1301–1310. - PubMed
    1. Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole-genome shotgun assembler. Genome Res. 2002;12:177–189. doi: 10.1101/gr.208902. - DOI - PMC - PubMed
    1. Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
    1. Choi S.D., Creelman R., Mullet J., Wing R.A., Creelman R., Mullet J., Wing R.A., Mullet J., Wing R.A., Wing R.A. Construction and characterization of a bacterial artificial chromosome library from Arabidopsis thaliana. Weeds World. 1995;2:17–20.