Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 24;3(3):e00202-17.
doi: 10.1128/mSystems.00202-17. eCollection 2018 May-Jun.

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

Affiliations

SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control

Gabriel A Al-Ghalith et al. mSystems. .

Abstract

Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced "shizen"), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.

Keywords: QC; algorithm; bioinformatics; metagenomics; microbiome; pipeline; quality control; sequencing; short read.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Linear schematic of the basic quality control procedure for marker gene (microbiome) data. The process flows from removing known technical artifacts, to assembling short contiguous regions, to trimming remaining contamination poststitching and creating a final set (or optionally, single pooled file) of sequences in the desired format (FASTA or FASTQ). Notable exceptions to this procedure exist: for instance, pairs may not be stitchable depending on the insert size for shotgun sequencing.
FIG 2
FIG 2
Histogram of stitched read lengths in a single Human Microbiome Project (HMP) metagenomic sample (a) and 16S V4 Primate Microbiome Project (PMP) sample (b). (a) A shotgun metagenomic sample produces stitched contigs spanning a range of lengths. The truncation after read lengths of 185 bp is due to enforcing a minimum overlap length of 15 base pairs, which in a data set consisting of 100-bp reads is the maximum allowable length (100 + 100 − 15). Because the mean of this distribution is 148.6 and its standard deviation is 20.62, the coefficient of variation (CV) is 0.139, above the 0.1 threshold under which the data would be considered amplicon-like by default; the data are hence considered shotgun reads by SHI7. (b) A 16S amplicon sample produces a distinct histogram marked by high representation of certain contig lengths corresponding to target gene size, in this case 252 and 253 base pairs, and a much lower CV (mean = 254.4, SD = 15.7; CV = 0.062). Most residual longer reads match PhiX174, an Illumina control contaminant, and are later removed by SHI7 in “learning mode” by filtering out sequences within a mean read length ± SD/2 in amplicon samples.
FIG 3
FIG 3
Comparison of illustrative BLAST alignments before and after SHI7 quality control on the same reads of an HMP shotgun sample. Panel a (top) shows the SHI7 QC read (right) achieving a different best-scoring alignment than the non-QC read (left) despite the former’s slightly lower identity (SHI7 alignment, 94% and E value of 1e−55; non-QC, 96% and E value of 2e−35). The same reference as in the non-QC alignment also appears for the SHI7 QC read with the same identity (96%) and 90% coverage, but in third place. Panel b (bottom) shows a different alignment; here the SHI7 QC read (right) finds the same best match as the non-QC read (left), but at higher identity and lower E value (SHI7, 96% and E value of 8e−72; non-QC, 94% and E value of 1e−32). The case demonstrated by panel a occurs less frequently than panel b for this test data but may have additional important implications for pipelines relying on “best-match” read mapping.

References

    1. Kahvejian A, Quackenbush J, Thompson JF. 2008. What would you do if you could sequence everything? Nat Biotechnol 26:1125–1133. doi:10.1038/nbt1494. - DOI - PMC - PubMed
    1. Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, Knight R. 2011. Experimental and analytical tools for studying the human microbiome. Nat Rev Genet 13:47–58. doi:10.1038/nrg3129. - DOI - PMC - PubMed
    1. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi:10.1038/nmeth.1923. - DOI - PMC - PubMed
    1. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. 2008. Error-correcting barcoded primers allow hundreds of samples to be pyrosequenced in multiplex. Nat Methods 5:235–237. doi:10.1038/nmeth.1184. - DOI - PMC - PubMed
    1. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. doi:10.1093/bioinformatics/btu170. - DOI - PMC - PubMed