. 2017 Dec 4;7(12):3839-3848.

doi: 10.1534/g3.117.300271.

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data

Xuhua Xia^{1

2}

Affiliations

¹ Department of Biology, University of Ottawa, Ontario K1N 6N5, Canada xxia@uottawa.ca.
² Ottawa Institute of Systems Biology, Ontario K1H 8M5, Canada xxia@uottawa.ca.

PMID: 29079682
PMCID: PMC5714481
DOI: 10.1534/g3.117.300271

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data

Xuhua Xia. G3 (Bethesda). 2017.

. 2017 Dec 4;7(12):3839-3848.

doi: 10.1534/g3.117.300271.

Author

Xuhua Xia^{1

2}

Affiliations

¹ Department of Biology, University of Ottawa, Ontario K1N 6N5, Canada xxia@uottawa.ca.
² Ottawa Institute of Systems Biology, Ontario K1H 8M5, Canada xxia@uottawa.ca.

PMID: 29079682
PMCID: PMC5714481
DOI: 10.1534/g3.117.300271

Abstract

Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI's SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the B. subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.

Keywords: ARSDA; novel storage solution; quantifying expression of paralogous genes; sequence format; transcriptomics.

PubMed Disclaimer

Figures

**Figure 1**
User interface in ARSDA. (A) The menu system, with database creation under the “Database” menu, gene expression characterization under the “Analysis” menu, etc. (B) Converting a FASTQ/FASTA file to a FASTQ+/FASTA+ file. (C) Site-specific read quality visualization. (D) Global read quality visualization.

**Figure 2**
Contrasting read quality between two transcriptomic data files (SRR5484239.sra from *M. musculus* and SRR922267.sra from *E. coli*. It does not imply that *E. coli* data are always better than mouse data as there are also poor-quality *E. coli* data and high-quality mouse data.

**Figure 3**
Allocation of shared reads in a gene family with three paralogous genes A, B, and C with three idealized segments with a conserved identical middle segment, strongly homologous first segment that is identical in B and C, and a diverged third segment. Reads and the gene segment they match to are of the same color.

**Figure 4**
Contrast in gene expression (RPKM) between ARSDA and Cufflinks output for the same transcriptomic data in file SRR1536586.sra for *E. coli* wild type.

**Figure 5**
Phylogenetic relationship among paralogous genes *cspA* to *cspI* in *E. coli*, based on coding sequences, with bootstrap values next to internal nodes. Sequences were aligned by MAFFT (Katoh and Toh 2008) with accurate L_INS-i option and a maximum of 16 iterations. Coding sequences were first translated in amino acid sequences, which are aligned with BLOSUM62 matrix. Nucleotide sequences were then aligned against aligned amino acid sequences. Phylogenetic analysis was done with PhyML (Guindon *et al.* 2010). All these analyses were automated in DAMBE (Xia 2013, 2017).

See this image and copyright information in PMC

Cited by

A computational system for identifying operons based on RNA-seq data.
Tjaden B. Tjaden B. Methods. 2020 Apr 1;176:62-70. doi: 10.1016/j.ymeth.2019.03.026. Epub 2019 Apr 4. Methods. 2020. PMID: 30953757 Free PMC article. Review.
An improved estimation of tRNA expression to better elucidate the coevolution between tRNA abundance and codon usage in bacteria.
Wei Y, Silke JR, Xia X. Wei Y, et al. Sci Rep. 2019 Feb 28;9(1):3184. doi: 10.1038/s41598-019-39369-x. Sci Rep. 2019. PMID: 30816249 Free PMC article.
Does Saccharomyces cerevisiae Require Specific Post-Translational Silencing against Leaky Translation of Hac1up?
Tehfe A, Roseshter T, Wei Y, Xia X. Tehfe A, et al. Microorganisms. 2021 Mar 17;9(3):620. doi: 10.3390/microorganisms9030620. Microorganisms. 2021. PMID: 33802931 Free PMC article.
How Changes in Anti-SD Sequences Would Affect SD Sequences in Escherichia coli and Bacillus subtilis.
Abolbaghaei A, Silke JR, Xia X. Abolbaghaei A, et al. G3 (Bethesda). 2017 May 5;7(5):1607-1615. doi: 10.1534/g3.117.039305. G3 (Bethesda). 2017. PMID: 28364038 Free PMC article.
RNA-Seq-Based Analysis Reveals Heterogeneity in Mature 16S rRNA 3' Termini and Extended Anti-Shine-Dalgarno Motifs in Bacterial Species.
Silke JR, Wei Y, Xia X. Silke JR, et al. G3 (Bethesda). 2018 Dec 10;8(12):3973-3979. doi: 10.1534/g3.118.200729. G3 (Bethesda). 2018. PMID: 30355764 Free PMC article.

See all "Cited by" articles

References

1. Abraham J. M., Feagin J. E., Stuart K., 1988. Characterization of cytochrome c oxidase III transcripts that are edited only in the 3′ region. Cell 55: 267–272. - PubMed
1. Alatortsev V. S., Cruz-Reyes J., Zhelonkina A. G., Sollner-Webb B., 2008. Trypanosoma brucei RNA editing: coupled cycles of U deletion reveal processive activity of the editing complex. Mol. Cell. Biol. 28: 2437–2445. - PMC - PubMed
1. Andrews, S., 2017 FastQC, Babraham Bioinformatics. Available at: https://www.bioinformatics.babraham.ac.uk/projects/fastqc.
1. Arava Y., Wang Y., Storey J. D., Liu C. L., Brown P. O., et al. , 2003. Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 100: 3889–3894. - PMC - PubMed
1. Awan A. R., Manfredo A., Pleiss J. A., 2013. Lariat sequencing in a unicellular yeast identifies regulated alternative splicing of exons that are evolutionarily conserved with humans. Proc. Natl. Acad. Sci. USA 110: 12762–12767. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data

Affiliations

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data

Author

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources