Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014;15 Suppl 11(Suppl 11):S10.
doi: 10.1186/1471-2105-15-S11-S10. Epub 2014 Oct 21.

SeqAssist: a novel toolkit for preliminary analysis of next-generation sequencing data

SeqAssist: a novel toolkit for preliminary analysis of next-generation sequencing data

Yan Peng et al. BMC Bioinformatics. 2014.

Abstract

Background: While next-generation sequencing (NGS) technologies are rapidly advancing, an area that lags behind is the development of efficient and user-friendly tools for preliminary analysis of massive NGS data. As an effort to fill this gap to keep up with the fast pace of technological advancement and to accelerate data-to-results turnaround, we developed a novel software package named SeqAssist ("Sequencing Assistant" or SA).

Results: SeqAssist takes NGS-generated FASTQ files as the input, employs the BWA-MEM aligner for sequence alignment, and aims to provide a quick overview and basic statistics of NGS data. It consists of three separate workflows: (1) the SA_RunStats workflow generates basic statistics about an NGS dataset, including numbers of raw, cleaned, redundant and unique reads, redundancy rate, and a list of unique sequences with length and read count; (2) the SA_Run2Ref workflow estimates the breadth, depth and evenness of genome-wide coverage of the NGS dataset at a nucleotide resolution; and (3) the SA_Run2Run workflow compares two NGS datasets to determine the redundancy (overlapping rate) between the two NGS runs. Statistics produced by SeqAssist or derived from SeqAssist output files are designed to inform the user: whether, what percentage, how many times and how evenly a genomic locus (i.e., gene, scaffold, chromosome or genome) is covered by sequencing reads, how redundant the sequencing reads are in a single run or between two runs. These statistics can guide the user in evaluating the quality of a DNA library prepared for RNA-Seq or genome (re-)sequencing and in deciding the number of sequencing runs required for the library. We have tested SeqAssist using a synthetic dataset and demonstrated its main features using multiple NGS datasets generated from genome re-sequencing experiments.

Conclusions: SeqAssist is a useful and informative tool that can serve as a valuable "assistant" to a broad range of investigators who conduct genome re-sequencing, RNA-Seq, or de novo genome sequencing and assembly experiments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
SeqAssist (SA) workflows: (a) SA_RunStats, (b) SA_Run2Ref, and (c) SA_Run2Run. The output of each workflow is described in details in the Implementation section.
Figure 2
Figure 2
Distribution of scaffold coverage breadth and depth generated in the output files of the SA_Run2Ref workflow for two genome re-sequencing datasets produced for the same ECT gDNA library and their combination: (a) ECT, (b) ECT_rerun, and (c) ECT + ECT_rerun. See Table 1 for more information about the sequencing runs. Breadth and depth bins are open at the lower end and closed at the higher end, and breadth is expressed as percentage. For instance, (0.3, 0.4] stands for 30% < breadth ≤ 40%, and (0, 1] stands for 0 < depth ≤ 1.
Figure 3
Figure 3
Change in genome coverage breadth, depth and evenness as more sequencing runs for the same TCO library were pooled and used as the input of SA_Run2Ref. See Table 2 for the sequencing runs pooled to form reads collections.
Figure 4
Figure 4
Change in the distribution of scaffold coverage breadth and depth as more sequencing runs for the same TCO library were pooled and used as the input of SA_Run2Ref. Shown are distributions for three reads collections: (a) LF1, (b) LF1-5, and (c) LF1-5SF1-5. See Table 2 for the sequencing runs pooled to form reads collections. Breadth and depth bins are open at the lower end and closed at the higher end, and breadth is expressed as percentage. For instance, (0.3, 0.4] stands for 30% < breadth ≤ 40%, and (0, 1] stands for 0 < depth ≤ 1.
Figure 5
Figure 5
Breakdown of cleaned reads from two sequencing runs (ECT and ECT_rerun) into overlapping and non-overlapping reads based on the output from SA_Run2Run (see Table 3 for more info). "N" represents reads containing N that were removed during preprocessing.
Figure 6
Figure 6
Memory usage recorded every 5 seconds when running the human genome re-sequencing data through the three SeqAssist workflows.

References

    1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif ) 2013;6:287–303. doi: 10.1146/annurev-anchem-062012-092628. - DOI - PubMed
    1. Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011;12:671–682. doi: 10.1038/nrg3068. - DOI - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Bernardi G, Wiley EO, Mansour H, Miller MR, Orti G, Haussler D. The fishes of Genome 10K. Mar Genomics. 2012;7:3–6. - PubMed

Publication types