Software for pre-processing Illumina next-generation sequencing short read sequences
- PMID: 24955109
- PMCID: PMC4064128
- DOI: 10.1186/1751-0473-9-8
Software for pre-processing Illumina next-generation sequencing short read sequences
Abstract
Background: When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.
Methods: We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.
Results: Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.
Conclusions: Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.
Keywords: De novo assembly; Illumina; Next-generation sequencing; Perl; Reference-based assembly; Trimming.
Similar articles
-
QC-Chain: fast and holistic quality control method for next-generation sequencing data.PLoS One. 2013;8(4):e60234. doi: 10.1371/journal.pone.0060234. Epub 2013 Apr 2. PLoS One. 2013. PMID: 23565205 Free PMC article.
-
Atropos: specific, sensitive, and speedy trimming of sequencing reads.PeerJ. 2017 Aug 30;5:e3720. doi: 10.7717/peerj.3720. eCollection 2017. PeerJ. 2017. PMID: 28875074 Free PMC article.
-
A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads.Front Genet. 2014 Feb 12;5:17. doi: 10.3389/fgene.2014.00017. eCollection 2014. Front Genet. 2014. PMID: 24575122 Free PMC article.
-
The present and future of de novo whole-genome assembly.Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.
-
De novo assembly of short sequence reads.Brief Bioinform. 2010 Sep;11(5):457-72. doi: 10.1093/bib/bbq020. Epub 2010 Aug 19. Brief Bioinform. 2010. PMID: 20724458 Review.
Cited by
-
Local Geomorphological Gradients and Land Use Patterns Play Key Role on the Soil Bacterial Community Diversity and Dynamics in the Highly Endemic Indigenous Afrotemperate Coastal Scarp Forest Biome.Front Microbiol. 2021 Feb 24;12:592725. doi: 10.3389/fmicb.2021.592725. eCollection 2021. Front Microbiol. 2021. PMID: 33716998 Free PMC article.
-
Poly(A) binding KPAF4/5 complex stabilizes kinetoplast mRNAs in Trypanosoma brucei.Nucleic Acids Res. 2020 Sep 4;48(15):8645-8662. doi: 10.1093/nar/gkaa575. Nucleic Acids Res. 2020. PMID: 32614436 Free PMC article.
-
High-throughput amplicon sequencing datasets of microbial community in soils irrigated by quicklime and fly ash-treated acid mine drainage water.Data Brief. 2023 Nov 26;52:109849. doi: 10.1016/j.dib.2023.109849. eCollection 2024 Feb. Data Brief. 2023. PMID: 38093854 Free PMC article.
-
A new technique for genome-wide mapping of nucleotide excision repair without immunopurification of damaged DNA.J Biol Chem. 2022 May;298(5):101863. doi: 10.1016/j.jbc.2022.101863. Epub 2022 Mar 23. J Biol Chem. 2022. PMID: 35339490 Free PMC article.
-
Genetic correlation network prediction of forest soil microbial functional organization.ISME J. 2018 Oct;12(10):2492-2505. doi: 10.1038/s41396-018-0232-8. Epub 2018 Jul 25. ISME J. 2018. PMID: 30046166 Free PMC article.
References
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous