Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 30:5:e3720.
doi: 10.7717/peerj.3720. eCollection 2017.

Atropos: specific, sensitive, and speedy trimming of sequencing reads

Affiliations

Atropos: specific, sensitive, and speedy trimming of sequencing reads

John P Didion et al. PeerJ. .

Abstract

A key step in the transformation of raw sequencing reads into biological insights is the trimming of adapter sequences and low-quality bases. Read trimming has been shown to increase the quality and reliability while decreasing the computational requirements of downstream analyses. Many read trimming software tools are available; however, no tool simultaneously provides the accuracy, computational efficiency, and feature set required to handle the types and volumes of data generated in modern sequencing-based experiments. Here we introduce Atropos and show that it trims reads with high sensitivity and specificity while maintaining leading-edge speed. Compared to other state-of-the-art read trimming tools, Atropos achieves significant increases in trimming accuracy while remaining competitive in execution times. Furthermore, Atropos maintains high accuracy even when trimming data with elevated rates of sequencing errors. The accuracy, high performance, and broad feature set offered by Atropos makes it an appropriate choice for the pre-processing of Illumina, ABI SOLiD, and other current-generation short-read sequencing datasets. Atropos is open source and free software written in Python (3.3+) and available at https://github.com/jdidion/atropos.

Keywords: Adapter; Cutadapt; Illumina; NGS; Preprocessing; Read; Sequencing; Trimming.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. Adapter detection and trimming.
(A) When a fragment (or insert; green) is shorter than the read length, the read sequence will contain partial to full-length adapter sequences (blue and purple). (B, C) Methods for detecting adapter contamination using semi-global alignment. Adapter-match (B) identifies the best alignment between each adapter and the end of its corresponding read. Insert-match (C) first identifies the best alignment between read 1 and the reverse-complement (rc) of read 2; if a valid alignment is found, then adapters are matched to the remaining overhangs. (D) If a match is found, the overlapping inserts can be used for mutual error correction. The consensus base is the one with the highest quality, or, if the bases have equal quality, the one from the read with highest mean quality. (E) If insert-match fails (for example, with an adapter dimer) adapter-match is performed. Reads that are too short after trimming are discarded.
Figure 2
Figure 2. Trimming execution time for simulated data.
Execution time on simulated datasets for programs running on (A) a desktop computer with four parallel processes (threads), and (B) a cluster node with 4, 8, and 16 threads. Each program was executed multiple times, and Atropos was run with combinations of alignment algorithm (insert-match or adapter-match) and output mode (worker-compression, writer-compression, or parallel-write). The mean execution times for each program are shown with 95% CI.
Figure 3
Figure 3. Trimming execution time for real data.
Execution time on (A) WGBS and (B) mRNA-Seq datasets for programs running on a cluster node with 4, 8, and 16 threads. Each program was executed multiple times, and Atropos was run with the insert-match algorithm and parallel-write output mode. The mean execution times for each program are shown with 95% CI.
Figure 4
Figure 4. Atropos trimming best improves mapping of real WGBS sequencing reads.
Reads were adapter-trimmed with all four programs (A) without additional quality trimming (Q = 0) and (B) with quality trimming at a threshold of Q = 20. We mapped both untrimmed and trimmed reads to the genome. For each MAPQ cutoff M ∈ {0,5,…,60} on the x-axis, the number of trimmed reads with MAPQ >= M less the number of untrimmed reads with MAPQ >= M is shown on the y-axis for each program.
Figure 5
Figure 5. Atropos trimming results in the greatest increase in mRNA-seq reads mapped to GENCODE regions.
Reads were adapter-trimmed with all four programs without additional quality trimming. We mapped both untrimmed and trimmed reads to the genome using STAR. When parameter outSAMmultNmax = 2, STAR produces only four MAPQ values: 255 = unique alignment, 3 = two alignments with similar but unequal score; 1 = two alignments with equal score; and 0 = unmapped. For each MAPQ cutoff M ∈ {0,1,3,255} on the x-axis, the number of trimmed reads that align to GENCODE regions with MAPQ >= M less the number of untrimmed reads with MAPQ >= M is shown on the y-axis for each program.

References

    1. Andrews S. FastQC: a quality control tool for high throughput sequence data. Version: 0.11.5http://www.bioinformatics.babraham.ac.uk/projects/fastqc 2010
    1. Bock C. Analysing and interpreting DNA methylation data. Nature Reviews Genetics. 2012;13(10):705–719. doi: 10.1038/nrg3273. - DOI - PubMed
    1. Boettiger C. An introduction to docker for reproducible research, with examples from the R environment. ACM SIGOPS Operating Systems Review. 2015;49(1):71–79. doi: 10.1145/2723872.272388. - DOI
    1. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLOS ONE. 2013;8(12):e85024. doi: 10.1371/journal.pone.0085024. - DOI - PMC - PubMed
    1. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011;43(5):491–498. doi: 10.1038/ng.806. - DOI - PMC - PubMed

LinkOut - more resources