Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug;24(8):746-755.
doi: 10.1089/cmb.2017.0021. Epub 2017 Apr 17.

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Affiliations

Zseq: An Approach for Preprocessing Next-Generation Sequencing Data

Abedalrhman Alkhateeb et al. J Comput Biol. 2017 Aug.

Abstract

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.

Keywords: RNA-SEQ analysis; machine learning; next-generation sequencing; preprocessing.

PubMed Disclaimer

Conflict of interest statement

No competing financial interests exist.

Figures

<b>FIG. 1.</b>
FIG. 1.
Schematic representation of the process for filtering reads using the Zseq method.
<b>FIG. 2.</b>
FIG. 2.
Distribution of the normalized uniqueness scores for all reads in sample (SRR202054) (formula image).
<b>FIG. 3.</b>
FIG. 3.
Distribution of the z-scores of the normalized uniqueness scores corresponding to each read for sample (SRR202054).
<b>FIG. 4.</b>
FIG. 4.
Percentage of GC content for all filtered reads using the Zseq histogram with formula image and formula image.
<b>FIG. 5.</b>
FIG. 5.
Percentage of GC content for all filtered reads using the DUST histogram with formula image and formula image.
<b>FIG. 6.</b>
FIG. 6.
Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using original reads.
<b>FIG. 7.</b>
FIG. 7.
Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using reads filtered by DUST.
<b>FIG. 8.</b>
FIG. 8.
Biologically meaningful human genomic sequences found using BLAST. De novo assembled transcripts using reads filtered by Zseq.
<b>FIG. 9.</b>
FIG. 9.
An example of two transcripts, one with separable FPKM values (a), and other transcript with inseparable FPKM values (b).

Similar articles

Cited by

References

    1. Altschul S.F., Madden T.L., Schäffer A.A., et al. . 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 - PMC - PubMed
    1. Brown T. Introduction to Genetics: A Molecular Approach. Garland Science, 2012. ISBN 9780815365099. Available at: URL http://books.google.ca/books?id=TsvKPQAACAAJ Last viewed on Jan. 20, 2017
    1. Cheadle C., Vawter M.P., Freed W.J., et al. . 2003. Analysis of microarray data using z score transformation. J. Mol. Diagn. 5, 73–81 - PMC - PubMed
    1. Chen Y.-C., Liu T., Yu C.-H., et al. . 2013. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS One 8, e62856. - PMC - PubMed
    1. Cortes C., and Vapnik V. 1995. Support-vector networks. Machine Learn. 20, 273–297