Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Nov;17(11):1519-33.
doi: 10.1089/cmb.2009.0238. Epub 2010 Oct 20.

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies

Affiliations

Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies

Sébastien Boisvert et al. J Comput Biol. 2010 Nov.

Abstract

An accurate genome sequence of a desired species is now a pre-requisite for genome research. An important step in obtaining a high-quality genome sequence is to correctly assemble short reads into longer sequences accurately representing contiguous genomic regions. Current sequencing technologies continue to offer increases in throughput, and corresponding reductions in cost and time. Unfortunately, the benefit of obtaining a large number of reads is complicated by sequencing errors, with different biases being observed with each platform. Although software are available to assemble reads for each individual system, no procedure has been proposed for high-quality simultaneous assembly based on reads from a mix of different technologies. In this paper, we describe a parallel short-read assembler, called Ray, which has been developed to assemble reads obtained from a combination of sequencing platforms. We compared its performance to other assemblers on simulated and real datasets. We used a combination of Roche/454 and Illumina reads to assemble three different genomes. We showed that mixing sequencing technologies systematically reduces the number of contigs and the number of errors. Because of its open nature, this new tool will hopefully serve as a basis to develop an assembler that can be of universal utilization (availability: http://deNovoAssembler.sf.Net/). For online Supplementary Material , see www.liebertonline.com.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
A subgraph of a de Bruijn graph. This figure shows a part of a de Bruijn graph. In this example, short reads are not enough for the assembly problem. Suppose that the true genome sequence is of the form formula image. If the length of the reads (or paired reads) is smaller than the formula image subsequence, no hints will help an assembly algorithm to differentiate the true sequence from the following one formula image. On the other hand, if there is a read that starts before z1 and ends after z2, there will be a possibility to solve this branching problem.
FIG. 2.
FIG. 2.
Coverage distributions. This figure shows the coverage distributions of k-mers for the A. baylyi ADP1 dataset with Roche/454, Illumina, and Roche/454 and Illumina, k = 21. The minimum coverage and the peak coverage are identified for the Roche/454, Illumina, and Roche/454 and Illumina coverage distributions. The peak coverage of Roche/454+Illumina is greater than the sum of the peak coverage of Roche/454 and the peak coverage of Illumina, which suggests that the mixed approach allows one to recover low-coverage regions.
FIG. 3.
FIG. 3.
The Ray algorithm. Ray is a greedy algorithm on a de Bruijn graph. The extension of seeds is carried out by the subroutine GrowSeed. Each seed is extended using the set of Rules 1 and 2. Afterwards, each extended seed is extended in the opposite direction using the reverse-complement path of the extended seed. Given two seeds s1 and s2, the reachability of s1 from s2 is not a necessary and sufficient condition of the reachability of s2 from s1. Owing to this property of reachability between seeds, a final merging step is necessary to remove things appearing twice in the assembly.

References

    1. Altschul S., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Aury J.-M., et al. High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies. BMC Genomics. 2008;9:603. - PMC - PubMed
    1. Barbe V., et al. Unique features revealed by the genome sequence of Acinetobacter sp. ADP1, a versatile and naturally transformation competent bacterium. Nucleic Acids Res. 2004;32:5766–5779. - PMC - PubMed
    1. Batzoglou S., et al. Arachne: a whole-genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed
    1. Bentley D.R., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources