Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(4):e60234.
doi: 10.1371/journal.pone.0060234. Epub 2013 Apr 2.

QC-Chain: fast and holistic quality control method for next-generation sequencing data

Affiliations

QC-Chain: fast and holistic quality control method for next-generation sequencing data

Qian Zhou et al. PLoS One. 2013.

Abstract

Next-generation sequencing (NGS) technologies have been widely used in life sciences. However, several kinds of sequencing artifacts, including low-quality reads and contaminating reads, were found to be quite common in raw sequencing data, which compromise downstream analysis. Therefore, quality control (QC) is essential for raw NGS data. However, although a few NGS data quality control tools are publicly available, there are two limitations: First, the processing speed could not cope with the rapid increase of large data volume. Second, with respect to removing the contaminating reads, none of them could identify contaminating sources de novo, and they rely heavily on prior information of the contaminating species, which is usually not available in advance. Here we report QC-Chain, a fast, accurate and holistic NGS data quality-control method. The tool synergeticly comprised of user-friendly tools for (1) quality assessment and trimming of raw reads using Parallel-QC, a fast read processing tool; (2) identification, quantification and filtration of unknown contamination to get high-quality clean reads. It was optimized based on parallel computation, so the processing speed is significantly higher than other QC methods. Experiments on simulated and real NGS data have shown that reads with low sequencing quality could be identified and filtered. Possible contaminating sources could be identified and quantified de novo, accurately and quickly. Comparison between raw reads and processed reads also showed that subsequent analyses (genome assembly, gene prediction, gene annotation, etc.) results based on processed reads improved significantly in completeness and accuracy. As regard to processing speed, QC-Chain achieves 7-8 time speed-up based on parallel computation as compared to traditional methods. Therefore, QC-Chain is a fast and useful quality control tool for read quality process and de novo contamination filtration of NGS reads, which could significantly facilitate downstream analysis. QC-Chain is publicly available at: http://www.computationalbioenergy.org/qc-chain.html.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The overall workflow of QC-Chain.
Figure 2
Figure 2. Evaluation of read quality on real NGS data by QC-Chain using Parallel-QC.
(A) Summary of sequencing-quality evaluation. (B) Comparison of running time of Parallel-QC, FASTX_Toolkit and PRINSEQ. S1, S2: human saliva DNA samples; A1, A2: in-house sequenced algae DNA samples. All the sequences could be downloadable from http://computationalbioenergy.org/qc-chain.html.
Figure 3
Figure 3. Possible source species identified from simulated genomic data by rDNA-reads based method of QC-Chain.
(A) 18S reads distribution identified by rDNA-reads based method. (B) 16S reads distribution identified by rDNA-reads based method. (C) Quantitative distribution of the reads identified by random-reads based method.
Figure 4
Figure 4. Possible source species identified from simulated metagenomic data by rDNA-reads based method of QC-Chain.
(A) 18S reads distribution identified by rDNA-reads based method. (B) Quantitative distribution of the reads identified by random-reads based method.
Figure 5
Figure 5. Comparison of the results from clean, total and control reads of simulated metagenomic data.
(A) GC distribution pattern. (B) Rarefaction curve that could discriminate the richness of different bacterial species, Y-axies: species count, X-axis: the number of reads. (C) Functional categories based on COG database. Abundance of each category between the three datasets was compared pair-wise (*p<0.05; **p<0.01; NS: not significant).

References

    1. Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, et al. (2012) Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data. PloS One 7: e31386. - PMC - PubMed
    1. Schmieder R, Edwards R (2011) Quality control and preprocessing of metagenomic datasets. Bioinformatics 27: 863–864. - PMC - PubMed
    1. Patel RK, Jain M (2012) NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data. PloS One 7: e30619. - PMC - PubMed
    1. Schmieder R, Edwards R (2011) Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets. PloS One 6: e17288. - PMC - PubMed
    1. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. - PMC - PubMed

Publication types

MeSH terms