. 2019 May 3;20(1):226.

doi: 10.1186/s12859-019-2799-0.

FastqPuri: high-performance preprocessing of RNA-seq data

Paula Pérez-Rubio¹, Claudio Lottaz¹, Julia C Engelmann²

Affiliations

¹ Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am BioPark 9, Regensburg, 93053, Germany.
² Department of Marine Microbiology and Biogeochemistry, NIOZ Royal Netherlands Institute for Sea Research and Utrecht University, P.O. Box 59, Den Burg, 1790 AB, The Netherlands. julia.engelmann@nioz.nl.

PMID: 31053060
PMCID: PMC6500068
DOI: 10.1186/s12859-019-2799-0

FastqPuri: high-performance preprocessing of RNA-seq data

Paula Pérez-Rubio et al. BMC Bioinformatics. 2019.

. 2019 May 3;20(1):226.

doi: 10.1186/s12859-019-2799-0.

Authors

Paula Pérez-Rubio¹, Claudio Lottaz¹, Julia C Engelmann²

Affiliations

¹ Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am BioPark 9, Regensburg, 93053, Germany.
² Department of Marine Microbiology and Biogeochemistry, NIOZ Royal Netherlands Institute for Sea Research and Utrecht University, P.O. Box 59, Den Burg, 1790 AB, The Netherlands. julia.engelmann@nioz.nl.

PMID: 31053060
PMCID: PMC6500068
DOI: 10.1186/s12859-019-2799-0

Abstract

Background: RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing.

Results: We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness.

Conclusions: FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri . It is implemented in C and R and licensed under GPL v3.

Keywords: Preprocessing; Quality control; RNA-seq; Sequence data; fastq.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Human cell sampling has been approved by the ethics committee of the University Medical Center Göttingen (Ethikkommission der Universitätsmedizin Göttingen), reference number 16/5/18An. All human participants granted written, informed consent.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Workflow for preprocessing fastq files with **FastqPuri**. Qreport generates a quality report in html format for each sample, while Sreport generates one summary quality report for all samples. Depending on the size of the sequence file with potential contaminations, makeTree or makeBloom generates a data structure for filtering contaminations. trimFilter (or trimFilterPE for paired-end data) filters and trims reads containing adapters or adapter remnants, biological contaminations and low quality bases. On the filtered reads, Qreport and Sreport can be run again to ensure that the filtered data meets the user’s expectations. Legend: yellow: fastq files, red: **FastqPuri** executables, green: **FastqPuri** quality reports in html format

**Fig. 2**
Graphics shown in Qreport. a Data set overview and basic statistics. b Per base sequence quality box plots. The blue line corresponds to the mean quality value. c Cycle average quality, per tile, per lane. d Nucleotide content per position. e Proportion of low quality bases, per tile, per lane. f Fraction of low quality bases {A, C, G, T} per position, per tile and per lane. g Proportion of bases with quality scores below different thresholds, for all tiles, all lanes. h Number of reads with m low quality bases

**Fig. 3**
Run times (user plus CPU time in seconds) of **FastqPuri**’s Qreport versus other tools for three different datasets. The datasets represent different quality encodings (Phred+33 and Phred+64) as well as different sequence name formats. Timings for SolexaQA++ on Illumina 1.3+ data are not shown because the smallest value was around 10 min and all other values became invisibly small on that scale

**Fig. 4**
Memory usage (in MB) of **FastqPuri**’s Qreport versus other tools for three different datasets. The datasets represent different quality encodings (Phred+33 and Phred+64) as well as different sequence name formats

**Fig. 5**
Run times (user plus CPU time in seconds) of **FastqPuri**’s trimFilter and trimFilterPE to remove adapter sequences versus fastp and trimmomatic

**Fig. 6**
Run times (user plus CPU time in seconds) and memory usage (in GB) of **FastqPuri**’s trimFilter and RNA-QC-Chain to remove reads from human rRNA transcripts

See this image and copyright information in PMC

References

1. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. 14.05.2018 Available online at http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed 14 May 2018.
1. Ballenghien M, Faivre N, Galtier N. Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions. BMC Biol. 2017;15:25. doi: 10.1186/s12915-017-0366-6. - DOI - PMC - PubMed
1. Bolger Anthony M., Lohse Marc, Usadel Bjoern. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. doi: 10.1093/bioinformatics/btu170. - DOI - PMC - PubMed
1. Bray Nicolas L, Pimentel Harold, Melsted Páll, Pachter Lior. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology. 2016;34(5):525–527. doi: 10.1038/nbt.3519. - DOI - PubMed
1. Chen S, Huang T, Zhou Y, Han Y, Xu M, Gu J. AfterQC: automatic filtering, trimming, error removing and quality control for fastq data. BMC Bioinformatics. 2017; 18(3):80. 10.1186/s12859-017-1469-3. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

031A428A/Bundesministerium für Bildung und Forschung

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FastqPuri: high-performance preprocessing of RNA-seq data

Affiliations

FastqPuri: high-performance preprocessing of RNA-seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases