Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep 1;63(1):41-9.
doi: 10.1016/j.ymeth.2013.06.027. Epub 2013 Jun 29.

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Affiliations

Kraken: a set of tools for quality control and analysis of high-throughput sequence data

Matthew P A Davis et al. Methods. .

Abstract

New sequencing technologies pose significant challenges in terms of data complexity and magnitude. It is essential that efficient software is developed with performance that scales with this growth in sequence information. Here we present a comprehensive and integrated set of tools for the analysis of data from large scale sequencing experiments. It supports adapter detection and removal, demultiplexing of barcodes, paired-end data, a range of read architectures and the efficient removal of sequence redundancy. Sequences can be trimmed and filtered based on length, quality and complexity. Quality control plots track sequence length, composition and summary statistics with respect to genomic annotation. Several use cases have been integrated into a single streamlined pipeline, including both mRNA and small RNA sequencing experiments. This pipeline interfaces with existing tools for genomic mapping and differential expression analysis.

Keywords: Adapter trimming; Algorithms; NGS; Next-generation sequencing; Pipelines; RNAseq; Sequencing; Tools.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Kraken suite. The Kraken tools are modular by nature, each addressing a discrete task. They are integrated into a workflow to process small RNA and paired end sequencing experiments. Parallel lines represent stages of the pipeline that can utilise multiple processors and analyse sequencing lanes concurrently.
Fig. 2
Fig. 2
Sample geometries. The position of adapter sequences, barcode sequences and sequence inserts relative to the library fragments in the sample libraries analysed in this manuscript.
Fig. 3
Fig. 3
Analysis of a barcoded small RNA cloning experiment. Examples of the plots produced by the SequenceImp pipeline to summarise the analysis of small RNA. (A) Two plots taken from the reaper stage of the SequenceImp pipeline. These plots describe the ACTG-barcoded sample before Reaper trims and cleans the reads. (B) The length of reads for the ACTG barcoded sample, at the filter stage. Trimming in the reaper step defines a clear 20–23 nt peak in this sample. At the filter step reads can be selected for downstream analysis based on length. Solid bars correspond to those reads passed to the later stages of the pipeline. Hashed bars represent reads removed, falling outside the maximum and minimum length criteria. (C) Reads passed from the filter step which map to Ensembl annotation at the align stage of the pipeline are separated into individual annotation classes.
Fig. 4
Fig. 4
Examples of additional Kraken features. Additional examples of the plots produced by the pipeline when analysing alternative datasets based upon different criteria. (A) A repeat analysis can be performed at the features step of the pipeline. This will align reads to repeat sequences (in this case LINE1 (GeneBank: M13002.1)) and calculate a series of metrics that can be used to identify signals apparent due to the presence of piRNAs within a sequencing sample. (B) Reaper can be applied with many different filtering and trimming options, here trimming 3′ adapter sequences, low complexity trailing sequences enriched in adenine, poor quality regions, sequences following regions enriched for N′s and removing reads that are subsequently less than 10 nt in length. The pie chart summarises the reasons for which reads were removed from the sample in their entirety (e.g. fall below the length threshold passed to Reaper). discarded_length_cutoff: adapter trimming reduced the length of read below the length threshold specified in the Reaper configuration file, discarded_tri: the trimming of low complexity regions reduced the length below the threshold, discarded_QQQ: trimming of low quality bases reduced the read length below the length threshold. (C) For paired end sequencing Tally identifies redundant read sequences at the filter step. This plot describes the reasons that Tally discards reads from each of the paired samples, while ensuring read pairing.
Fig. 5
Fig. 5
Read trimming and filtering benchmarking. (A) Run-time for a test benchmark dataset of 1, 5, 10 and 25 million reads for Reaper, Btrim, Cutadapt, FASTX and Adapter Removal. For each size the total runtime in seconds for each method is given. Input was in all cases provided as compressed FASTQ format and output was compressed on the fly. The same adapter sequence and barcode sequences were provided to each method. (B) Memory usage and run-time benchmark for a deduplication task for a FASTQ file with 65 M reads and 2.5G bases. Results are shown for Tally, Fastx_collapser, and a simple custom Perl program employing an associative array, including a Tally run where quality data was tracked for each deduplicated read (using the per-base maximum quality score across all duplicated reads).

References

    1. Alon S., Vigneault F., Eminaga S., Christodoulou D.C., Seidman J.G., Church G.M., Eisenberg E. Genome Res. 2011;21(9):1506–1511. - PMC - PubMed
    1. ENCODE Project Consortium PLoS Biol. 2011;9(4):e1001046. - PMC - PubMed
    1. Flicek P., Amode M.R., Barrell D., Beal K., Brent S., Chen Y. Nucl Acids Res. 2011;39:D800–D806. - PMC - PubMed
    1. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S. Genome Biol. 2004;5(10):R80. - PMC - PubMed
    1. Gunaratne P.H., Coarfa C., Soibam B., Tandon A. Methods Mol Biol (Clifton, NJ) 2012;822:273–288. - PubMed

Publication types