Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Matthew P A Davis¹, Stijn van Dongen, Cei Abreu-Goodger, Nenad Bartonicek, Anton J Enright

Affiliations

PMID: 23816787
PMCID: PMC3991327
DOI: 10.1016/j.ymeth.2013.06.027

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Matthew P A Davis et al. Methods. 2013.

. 2013 Sep 1;63(1):41-9.

doi: 10.1016/j.ymeth.2013.06.027. Epub 2013 Jun 29.

Authors

Matthew P A Davis¹, Stijn van Dongen, Cei Abreu-Goodger, Nenad Bartonicek, Anton J Enright

Affiliation

¹ EMBL - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

PMID: 23816787
PMCID: PMC3991327
DOI: 10.1016/j.ymeth.2013.06.027

Abstract

New sequencing technologies pose significant challenges in terms of data complexity and magnitude. It is essential that efficient software is developed with performance that scales with this growth in sequence information. Here we present a comprehensive and integrated set of tools for the analysis of data from large scale sequencing experiments. It supports adapter detection and removal, demultiplexing of barcodes, paired-end data, a range of read architectures and the efficient removal of sequence redundancy. Sequences can be trimmed and filtered based on length, quality and complexity. Quality control plots track sequence length, composition and summary statistics with respect to genomic annotation. Several use cases have been integrated into a single streamlined pipeline, including both mRNA and small RNA sequencing experiments. This pipeline interfaces with existing tools for genomic mapping and differential expression analysis.

Keywords: Adapter trimming; Algorithms; NGS; Next-generation sequencing; Pipelines; RNAseq; Sequencing; Tools.

PubMed Disclaimer

Figures

**Fig. 1**
Kraken suite. The Kraken tools are modular by nature, each addressing a discrete task. They are integrated into a workflow to process small RNA and paired end sequencing experiments. Parallel lines represent stages of the pipeline that can utilise multiple processors and analyse sequencing lanes concurrently.

**Fig. 2**
Sample geometries. The position of adapter sequences, barcode sequences and sequence inserts relative to the library fragments in the sample libraries analysed in this manuscript.

**Fig. 3**
Analysis of a barcoded small RNA cloning experiment. Examples of the plots produced by the SequenceImp pipeline to summarise the analysis of small RNA. (A) Two plots taken from the *reaper* stage of the SequenceImp pipeline. These plots describe the ACTG-barcoded sample before Reaper trims and cleans the reads. (B) The length of reads for the ACTG barcoded sample, at the *filter* stage. Trimming in the *reaper* step defines a clear 20–23 nt peak in this sample. At the *filter* step reads can be selected for downstream analysis based on length. Solid bars correspond to those reads passed to the later stages of the pipeline. Hashed bars represent reads removed, falling outside the maximum and minimum length criteria. (C) Reads passed from the *filter* step which map to Ensembl annotation at the *align* stage of the pipeline are separated into individual annotation classes.

**Fig. 4**
Examples of additional Kraken features. Additional examples of the plots produced by the pipeline when analysing alternative datasets based upon different criteria. (A) A repeat analysis can be performed at the *features* step of the pipeline. This will align reads to repeat sequences (in this case LINE1 (GeneBank: M13002.1)) and calculate a series of metrics that can be used to identify signals apparent due to the presence of piRNAs within a sequencing sample. (B) Reaper can be applied with many different filtering and trimming options, here trimming 3′ adapter sequences, low complexity trailing sequences enriched in adenine, poor quality regions, sequences following regions enriched for N′s and removing reads that are subsequently less than 10 nt in length. The pie chart summarises the reasons for which reads were removed from the sample in their entirety (e.g. fall below the length threshold passed to Reaper). *discarded_length_cutoff*: adapter trimming reduced the length of read below the length threshold specified in the Reaper configuration file, *discarded_tri*: the trimming of low complexity regions reduced the length below the threshold, *discarded_QQQ*: trimming of low quality bases reduced the read length below the length threshold. (C) For paired end sequencing Tally identifies redundant read sequences at the *filter* step. This plot describes the reasons that Tally discards reads from each of the paired samples, while ensuring read pairing.

**Fig. 5**
Read trimming and filtering benchmarking. (A) Run-time for a test benchmark dataset of 1, 5, 10 and 25 million reads for Reaper, Btrim, Cutadapt, FASTX and Adapter Removal. For each size the total runtime in seconds for each method is given. Input was in all cases provided as compressed FASTQ format and output was compressed on the fly. The same adapter sequence and barcode sequences were provided to each method. (B) Memory usage and run-time benchmark for a deduplication task for a FASTQ file with 65 M reads and 2.5G bases. Results are shown for Tally, Fastx_collapser, and a simple custom Perl program employing an associative array, including a Tally run where quality data was tracked for each deduplicated read (using the per-base maximum quality score across all duplicated reads).

See this image and copyright information in PMC

References

1. Alon S., Vigneault F., Eminaga S., Christodoulou D.C., Seidman J.G., Church G.M., Eisenberg E. Genome Res. 2011;21(9):1506–1511. - PMC - PubMed
1. ENCODE Project Consortium PLoS Biol. 2011;9(4):e1001046. - PMC - PubMed
1. Flicek P., Amode M.R., Barrell D., Beal K., Brent S., Chen Y. Nucl Acids Res. 2011;39:D800–D806. - PMC - PubMed
1. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S. Genome Biol. 2004;5(10):R80. - PMC - PubMed
1. Gunaratne P.H., Coarfa C., Soibam B., Tandon A. Methods Mol Biol (Clifton, NJ) 2012;822:273–288. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

BB/01589X/1/Biotechnology and Biological Sciences Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Affiliation

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources