Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 8;2(2):e107.
doi: 10.1002/imt2.107. eCollection 2023 May.

Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp

Affiliations

Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp

Shifu Chen. Imeta. .

Abstract

A large amount of sequencing data is generated and processed every day with the continuous evolution of sequencing technology and the expansion of sequencing applications. One consequence of such sequencing data explosion is the increasing cost and complexity of data processing. The preprocessing of FASTQ data, which means removing adapter contamination, filtering low-quality reads, and correcting wrongly represented bases, is an indispensable but resource intensive part of sequencing data analysis. Therefore, although a lot of software applications have been developed to solve this problem, bioinformatics scientists and engineers are still pursuing faster, simpler, and more energy-efficient software. Several years ago, the author developed fastp, which is an ultrafast all-in-one FASTQ data preprocessor with many modern features. This software has been approved by many bioinformatics users and has been continuously maintained and updated. Since the first publication on fastp, it has been greatly improved, making it even faster and more powerful. For instance, the duplication evaluation module has been improved, and a new deduplication module has been added. This study aimed to introduce the new features of fastp and demonstrate how it was designed and implemented.

Keywords: FASTQ; adapter; duplication; filtering; preprocessing; quality control.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

Figure 1
Figure 1
Part of interactive statistical plots of fastp. (A) The per‐cycle quality curves, and (B) the per‐cycle base content curves. (C) The distribution of evaluated insert size, with a small portion of reads remaining unknown due to their paired reads that are not overlapped, which is usually due to the fragments being too long. (D) The statistics of overrepresented sequences, including their per‐cycle distribution.
Figure 2
Figure 2
Paired‐end data processing workflow of fastp. The workflow can be simply divided into a decompressor, a preprocessor, and a compressor. The input‐paired FASTQ files are decompressed individually to read packs, and each pack consists of fixed read records. Each worker thread picks the odd or even read packs one by one, processes the reads, makes some statistics, and outputs the clean data to the compressor in the same order.
Figure 3
Figure 3
How fastp determines whether a read is unique or duplicated.

Similar articles

Cited by

References

    1. Gargis, Amy S. , Kalman Lisa, Berry Meredith W., Bick David P., Dimmock David P., Hambuch Tina, Lu Fei, et al. 2012. “Assuring the Quality of Next‐Generation Sequencing in Clinical Laboratory Practice.” Nature Biotechnology 30: 1033–6. 10.1038/nbt.2403 - DOI - PMC - PubMed
    1. Deng, Shibing , Lira Maruja, Huang Donghui, Wang Kai, Valdez Crystal, Kinong Jennifer, and Rejto Paul A., et al. 2018. “TNER: A Novel Background Error Suppression Method for Mutation Detection in Circulating Tumor DNA.” BMC Bioinformatics 19: 387. 10.1186/s12859-018-2428-3 - DOI - PMC - PubMed
    1. Martin, Marcel . 2011. “Cutadapt Removes Adapter Sequences from High‐throughput Sequencing Reads.” EMBnet.journal 17: 10. 10.14806/ej.17.1.200 - DOI
    1. Bolger, Anthony M. , Lohse Marc, and Usadel Bjoern. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30: 2114–20. 10.1093/bioinformatics/btu170 - DOI - PMC - PubMed
    1. Brown, Joseph , Pirrung Meg, and McCue Lee Ann. 2017. “FQC Dashboard: Integrates FastQC Results Into a Web‐based, Interactive, and Extensible FASTQ Quality Control Tool.” Bioinformatics 33: 3137–9. 10.1093/bioinformatics/btx373 - DOI - PMC - PubMed

LinkOut - more resources