Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Dec 23;8(12):e85024.
doi: 10.1371/journal.pone.0085024. eCollection 2013.

An extensive evaluation of read trimming effects on Illumina NGS data analysis

Affiliations
Comparative Study

An extensive evaluation of read trimming effects on Illumina NGS data analysis

Cristian Del Fabbro et al. PLoS One. .

Abstract

Next Generation Sequencing is having an extremely strong impact in biological and medical research and diagnostics, with applications ranging from gene expression quantification to genotyping and genome reconstruction. Sequencing data is often provided as raw reads which are processed prior to analysis 1 of the most used preprocessing procedures is read trimming, which aims at removing low quality portions while preserving the longest high quality part of a NGS read. In the current work, we evaluate nine different trimming algorithms in four datasets and three common NGS-based applications (RNA-Seq, SNP calling and genome assembly). Trimming is shown to increase the quality and reliability of the analysis, with concurrent gains in terms of execution time and computational resources needed.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: Dr. Simone Scalabrin, one of the authors of the manuscript, is currently affiliated with IGA Technology Services, which financially supports the study. This does not alter the authors' adherence to all PLOS ONE policies on sharing data and materials.

Figures

Figure 1
Figure 1. Barplots indicating the performance of nine read trimming tools at different quality thresholds on a Homo sapiens RNA-Seq dataset.
For ConDeTri, two basic parameters are necessary, and combinations of both are reported (which explains the non-monotonic appearance of the barplots). Red bars indicate the percentage of reads aligning in the trimmed dataset. Blue bars indicate the number of reads surviving trimming.
Figure 2
Figure 2. Fraction of reads mapped vs. number of reads in the quality trimmed Homo sapiens RNA-Seq dataset.
Each symbol corresponds to a quality threshold. Peak Q parameters for each tool are reported.
Figure 3
Figure 3. Comparative assessment of variant detection based on Prunus persica reads aligned on the reference peach genome.
Several read trimming method/threshold combinations are tested. The Average Percentage of Minor Allele Call (APOMAC) or of Non-reference Allele Call (APONAC) are reported, together with the total number of high-confidence SNPs.
Figure 4
Figure 4. Number of covered nucleotides in the Prunus persica genome (total size: 227M bases) above minimum coverage thresholds.
The analysis was performed on untrimmed reads and after trimming with 9 tools at Q=20 (for ConDeTri, default parameters HQ=25 and LQ=10 were used).
Figure 5
Figure 5. Comparative assessment of genome assembly metrics based on Prunus persica reads.
Several read trimming method/threshold combinations are tested. Yellow bars report the N50 (relative to the untrimmed dataset N50). Blue bars report the accuracy of the assembly (% of the assembled nucleotides that could be aligned on the reference Prunus persica genome). Red bars report the recall of the assembly (% of the reference Prunus persica genome covered by the assembly).
Figure 6
Figure 6. Computational requirements necessary for full Prunus persica genome assembly (RAM peak and time) for different combinations of read trimming tools and thresholds.

References

    1. Biesecker LG, Burke W, Kohane I, Plon SE, Zimmern R (2012) Next-generation sequencing in the clinic: are we ready? Nat Rev Genet 13: 818-824. doi: 10.1038/nrg3357. PubMed: 23076269. - DOI - PMC - PubMed
    1. Schuster SC (2007) Next-generation sequencing transforms today’s biology. Nature 200. - PubMed
    1. Li R, Fan W, Tian G, Zhu H, He L et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463: 311-317. PubMed: 20010809. - PMC - PubMed
    1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29: 644-652. doi: 10.1038/nbt.1883. PubMed: 21572440. - DOI - PMC - PubMed
    1. Iyer MK, Chinnaiyan AM, Maher CA (2011) ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics 27: 2903-2904. doi: 10.1093/bioinformatics/btr467. PubMed: 21840877. - DOI - PMC - PubMed

Publication types

MeSH terms