. 2012 Jun 15;28(12):i188-96.

doi: 10.1093/bioinformatics/bts219.

SEQuel: improving the accuracy of genome assemblies

Roy Ronen¹, Christina Boucher, Hamidreza Chitsaz, Pavel Pevzner

Affiliations

PMID: 22689760
PMCID: PMC3371851
DOI: 10.1093/bioinformatics/bts219

SEQuel: improving the accuracy of genome assemblies

Roy Ronen et al. Bioinformatics. 2012.

. 2012 Jun 15;28(12):i188-96.

doi: 10.1093/bioinformatics/bts219.

Authors

Roy Ronen¹, Christina Boucher, Hamidreza Chitsaz, Pavel Pevzner

Affiliation

¹ Bioinformatics Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA.

PMID: 22689760
PMCID: PMC3371851
DOI: 10.1093/bioinformatics/bts219

Abstract

Motivation: Assemblies of next-generation sequencing (NGS) data, although accurate, still contain a substantial number of errors that need to be corrected after the assembly process. We develop SEQuel, a tool that corrects errors (i.e. insertions, deletions and substitution errors) in the assembled contigs. Fundamental to the algorithm behind SEQuel is the positional de Bruijn graph, a graph structure that models k-mers within reads while incorporating the approximate positions of reads into the model.

Results: SEQuel reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30% and 94% of the substitution errors. Further, we show SEQuel is imperative to improving single-cell assembly, which is inherently more challenging due to higher error rates and non-uniform coverage; over half of the small indels, and substitution errors in the single-cell assemblies were corrected. We apply SEQuel to the recently assembled Deltaproteobacterium SAR324 genome, which is the first bacterial genome with a comprehensive single-cell genome assembly, and make over 800 changes (insertions, deletions and substitutions) to refine this assembly.

Availability: SEQuel can be used as a post-processing step in combination with any NGS assembler and is freely available at http://bix.ucsd.edu/SEQuel/.

PubMed Disclaimer

Figures

**Fig. 1.**
An example of a bulge on eight vertices in a de Bruijn graph (k=4) resulting from a sequencing error. During the process of bulge removal, the correct path (top: CCT-CTA-TAG-AGG-GGA) may be discarded, thus creating a substitution error in the final contig. This may occur if, for example, coverage is taken as a consideration, since the bottom path (CCT-CTT-TTG-TGG-GGA), erroneous in this case, may have higher coverage due to k-mers originating from other parts of the genome

**Fig. 2.**
An example illustrating the positional de Bruijn graph (k=4, Δ=1) and de Bruijn graph on a set of aligned reads, with their corresponding sets of k-mers and positional k-mers. There exists a single sequencing error in the reads (shown in red). In the de Bruijn graph, the (k − 1)-mer GCC appears as a single vertex, whereas, the positional de Bruijn graph separates the occurrence of GCC into two vertices. This additional information incorporated into the graph further constraints the gluing process and reduces complexity. Further, the positional k-mers (GCCT, 111) and (GCCT,975) have multiplicity 1 and 4, respectively, but the k-mer GCCT has multiplicity 5. This increases the weight of the incorrect path, and thus the likelihood of an error in the contig produced by the de Bruijn graph. Lastly, we note that in this example no vertex gluing operations occur but in more complex instances, vertex gluing will occur when equal k-mers align at adjacent positions

**Fig. 3.**
Illustration of the change in the number of short (≤50 bp) indels (a) and substitution errors (b) relative to the reference genome before and after the use of SEQuel. Standard reads were assembled using Euler-SR and Velvet. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively

**Fig. 4.**
Illustration of the change in the total number of short (≤50 bp) indels (a) and substitution errors (b) in assemblies before and after the use of SEQuel. Paired-end reads from a single-cell sample were assembled using Euler-SR and Velvet-SC. The assembly without SEQuel and with SEQuel is shown in blue and red, respectively

**Fig. 5.**
The first illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 1975 bp contig from the assembly with Velvet and k=31 (contig number 170157), showing two insertions in the alignment, having respective lengths 1 bp and 15 bp. The de Bruijn graph constructed from the set of permissively aligned reads to this contig contains bulges and whirls at regions corresponding to the insertions in the contigs

**Fig. 6.**
The second illustration of the connection between assembly errors, and whirls and bulges in the de Bruijn graph. The alignment of a 725 bp contig from the assembly with Velvet and k=31 (contig number 10362) shows two deletions in the contig, having respective lengths 20 bp and 7 bp. The regions in the de Bruijn graph corresponding to the deletions in alignment are complex and contain bulges and whirls that likely lead to assembly errors

See this image and copyright information in PMC

References

1. Alkan S., et al. Limitations of next-generation genome sequence assembly. Nature Meth. 2011;8:61–65. - PMC - PubMed
1. Bankevich A., et al. SPAdes: a New Genome Assembly Algorithm and its Applications to Single-Cell Sequencing. J. Comp. Bio. 2012;19:455–477. - PMC - PubMed
1. Bentley D.R., et al. Accurate whole genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
1. Butler J., et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
1. Chitsaz H., et al. Efficient de novo assembly of single-cell bacterial genomes from short-read datasets. Nature Biotech. 2011;29:915–921. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SEQuel: improving the accuracy of genome assemblies

Affiliation

SEQuel: improving the accuracy of genome assemblies

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources