Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Feb 2:17:56.
doi: 10.1186/s12859-016-0915-y.

ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research

Affiliations

ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research

Ram Vinay Pandey et al. BMC Bioinformatics. .

Abstract

Background: Traditional Sanger sequencing has been used as a gold standard method for genetic testing in clinic to perform single gene test, which has been a cumbersome and expensive method to test several genes in heterogeneous disease such as cancer. With the advent of Next Generation Sequencing technologies, which produce data on unprecedented speed in a cost effective manner have overcome the limitation of Sanger sequencing. Therefore, for the efficient and affordable genetic testing, Next Generation Sequencing has been used as a complementary method with Sanger sequencing for disease causing mutation identification and confirmation in clinical research. However, in order to identify the potential disease causing mutations with great sensitivity and specificity it is essential to ensure high quality sequencing data. Therefore, integrated software tools are lacking which can analyze Sanger and NGS data together and eliminate platform specific sequencing errors, low quality reads and support the analysis of several sample/patients data set in a single run.

Results: We have developed ClinQC, a flexible and user-friendly pipeline for format conversion, quality control, trimming and filtering of raw sequencing data generated from Sanger sequencing and three NGS sequencing platforms including Illumina, 454 and Ion Torrent. First, ClinQC convert input read files from their native formats to a common FASTQ format and remove adapters, and PCR primers. Next, it split bar-coded samples, filter duplicates, contamination and low quality sequences and generates a QC report. ClinQC output high quality reads in FASTQ format with Sanger quality encoding, which can be directly used in down-stream analysis. It can analyze hundreds of sample/patients data in a single run and generate unified output files for both Sanger and NGS sequencing data. Our tool is expected to be very useful for quality control and format conversion of Sanger and NGS data to facilitate improved downstream analysis and mutation screening.

Conclusions: ClinQC is a powerful and easy to handle pipeline for quality control and trimming in clinical research. ClinQC is written in Python with multiprocessing capability, run on all major operating systems and is available at https://sourceforge.net/projects/clinqc.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The workflow of ClinQC pipeline. ClinQC tool can be run with a single command. The flow of analysis is depicted from top to bottom. BASE CALLING (violet color) step is only applicable for Sanger data analysis; DEMULTIPLEXING and DUPLICATE & CONATMINATION FILTERING (yellow color) steps are only applicable for NGS data analysis; all other steps (green color) are applicable for both analysis flows. ClinQC generates three final outputs
Fig. 2
Fig. 2
The format conversion workflow of ClinQC. ClinQC takes raw reads in any native file format of their sequencing platforms and returns a unified FASTQ files with Sanger (PHRED) quality encoding
Fig. 3
Fig. 3
ClinQC final output. a QC summary table generated for each run, which includes experimental, patient, sequencing and QC information, one row for each sample/patient, (b) QC report generated by FASTQC before (left) and after (right) quality control for each sample/patient and linked in summary table, (c) FASTQ files with high quality reads for each sample/patient and linked in summary table
Fig. 4
Fig. 4
ClinQC quality control report generated by FASTQC. a Per base sequence quality before quality control and (b) per base sequence quality after quality control. ClinQC generates several useful QC plots for each patient’s FASTQ file before and after quality control. This feature enables to directly compare the data quality improvements and the number of filtered reads before and after quality control

Similar articles

Cited by

References

    1. Ardeshirdavani A, Souche E, Dehaspe L, Van Houdt J, Vermeesch JR, Moreau Y. NGS-Logistics: federated analysis of NGS sequence variants across multiple locations. Genome Med. 2014;6(9):71. - PMC - PubMed
    1. Gowrisankar S, Lerner-Ellis JP, Cox S, White ET, Manion M, LeVan K, et al. Evaluation of second-generation sequencing of 19 dilated cardiomyopathy genes for clinical applications. J Mol Diagn. 2010;12(6):818–27. doi: 10.2353/jmoldx.2010.100014. - DOI - PMC - PubMed
    1. Valencia CA, Ankala A, Rhodenizer D, Bhide S, Littlejohn MR, Keong LM, et al. Comprehensive mutation analysis for congenital muscular dystrophy: a clinical PCR-based enrichment and next-generation sequencing panel. PLoS One. 2013;8(1) doi: 10.1371/journal.pone.0053083. - DOI - PMC - PubMed
    1. Johnston JJ, Rubinstein WS, Facio FM, Ng D, Singh LN, Teer JK, et al. Secondary variants in individuals undergoing exome sequencing: screening of 572 individuals identifies high-penetrance mutations in cancer-susceptibility genes. Am J Hum Genet. 2012;91(1):97–108. doi: 10.1016/j.ajhg.2012.05.021. - DOI - PMC - PubMed
    1. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2014;15(2):256–78. doi: 10.1093/bib/bbs086. - DOI - PMC - PubMed

Publication types

LinkOut - more resources