Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 26;9(1):5133.
doi: 10.1038/s41598-019-41502-9.

Improving in-silico normalization using read weights

Affiliations

Improving in-silico normalization using read weights

Dilip A Durai et al. Sci Rep. .

Abstract

Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA ).

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Average read quality score (a) and average read abundance score (b) distribution in Brain RNA-seq data. The position-wise distribution of average read quality score (c) and average read abundance score (d) in the brain dataset. Reads in both datasets were divided into bins of 1 million (x-axis). These bins were then considered as partial datasets and the scores (Q¯x and K¯x) was calculated for each bin (y-axis).
Figure 2
Figure 2
Comparison of ORNA-Q (a) and ORNA-K (b) against ORNA applied on different read orderings (x-axis) for the brain dataset. Order 1 denotes the original dataset ordering. Order 2–4 was obtained by random reshuffling of the reads. The average scores of the reads from the reduced dataset is shown on the y-axis. All the above orders results in similar amount of reduction.
Figure 3
Figure 3
Effect of varying the log base parameter b (x-axis) on the average read weight (y-axis), (a) Q¯(R) and (b) K¯(R) of the normalized brain datasets. The black and grey bars represent normalization using ORNA-Q/-K and ORNA, respectively.
Figure 4
Figure 4
Comparison of assemblies generated from normalized datasets. The % of reads reduced (x-axis) by a normalization algorithm is compared against % of complete (y-axis: an assembly performance measure). Each point on a line corresponds to a different parametrization of the algorithms. (a,b) Represent TransABySS assemblies (k = 21) applied to normalized brain and HeLa data, respectively.

References

    1. Ghurye JS, et al. Metagenomic assembly: Overview, challenges and applications. The Yale J. Biol. Medicine. 2016;89:353–362. - PMC - PubMed
    1. Moreton J, et al. Assembly, Assessment, and Availability of De novo Generated Eukaryotic Transcriptomes. Front. Genet. 2015;6:361. - PMC - PubMed
    1. Grabherr MG, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52. doi: 10.1038/nbt.1883. - DOI - PMC - PubMed
    1. Schulz MH, et al. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinforma. (Oxford, England). 2012;28:1086–92. doi: 10.1093/bioinformatics/bts094. - DOI - PMC - PubMed
    1. Chikhi R, Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms for Mol. Biol. 2013;8:22. doi: 10.1186/1748-7188-8-22. - DOI - PMC - PubMed

Publication types