Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun 15;28(12):i188-96.
doi: 10.1093/bioinformatics/bts219.

SEQuel: improving the accuracy of genome assemblies

Affiliations

SEQuel: improving the accuracy of genome assemblies

Roy Ronen et al. Bioinformatics. .

Abstract

Motivation: Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.

Results: SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.

Availability: SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example of a bulge on eight vertices in a de Bruijn graph (k=4) resulting from a sequencing error. During the process of bulge removal, the correct path (top: CCT-CTA-TAG-AGG-GGA) may be discarded, thus creating a substitution error in the final contig. This may occur if, for example, coverage is taken as a consideration, since the bottom path (CCT-CTT-TTG-TGG-GGA), erroneous in this case, may have higher coverage due to k-mers originating from other parts of the genome
Fig. 2.
Fig. 2.
An example illustrating the positional de Bruijn graph (k=4, Δ=1) and de Bruijn graph on a set of aligned reads, with their corresponding sets of k-mers and positional k-mers. There exists a single sequencing error in the reads (shown in red). In the de Bruijn graph, the (k − 1)-mer GCC appears as a single vertex, whereas, the positional de Bruijn graph separates the occurrence of GCC into two vertices. This additional information incorporated into the graph further constraints the gluing process and reduces complexity. Further, the positional k-mers (GCCT, 111) and (GCCT,975) have multiplicity 1 and 4, respectively, but the k-mer GCCT has multiplicity 5. This increases the weight of the incorrect path, and thus the likelihood of an error in the contig produced by the de Bruijn graph. Lastly, we note that in this example no vertex gluing operations occur but in more complex instances, vertex gluing will occur when equal k-mers align at adjacent positions
Fig. 3.
Fig. 3.
Illustration of the change in the number of short (≤50 bp) indels (a) and substitution errors (b) relative to the reference genome before and after the use of SEQuel. Standard reads were assembled using Euler-SR and Velvet. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively
Fig. 4.
Fig. 4.
Illustration of the change in the total number of short (≤50 bp) indels (a) and substitution errors (b) in assemblies before and after the use of SEQuel. Paired-end reads from a single-cell sample were assembled using Euler-SR and Velvet-SC. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively
Fig. 5.
Fig. 5.
The first illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 1975 bp contig from the assembly with Velvet and k=31 (contig number 170157), showing two insertions in the alignment, having respective lengths 1 bp and 15 bp. The de Bruijn graph constructed from the set of permissively aligned reads to this contig contains bulges and whirls at regions corresponding to the insertions in the contigs
Fig. 6.
Fig. 6.
The second illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 725 bp contig from the assembly with Velvet and k=31 (contig number 10362) shows two deletions in the contig, having respective lengths 20 bp and 7 bp. The regions in the de Bruijn graph corresponding to the deletions in alignment are complex and contain bulges and whirls that likely lead to assembly errors

References

    1. Alkan S., et al. Limitations of next-generation genome sequence assembly. Nature Meth. 2011;8:61–65. - PMC - PubMed
    1. Bankevich A., et al. SPAdes: a New Genome Assembly Algorithm and its Applications to Single-Cell Sequencing. J. Comp. Bio. 2012;19:455–477. - PMC - PubMed
    1. Bentley D.R., et al. Accurate whole genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Butler J., et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Chitsaz H., et al. Efficient de novo assembly of single-cell bacterial genomes from short-read datasets. Nature Biotech. 2011;29:915–921. - PMC - PubMed

Publication types