Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 4;23(1):97.
doi: 10.1186/s12864-021-08278-7.

Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

Affiliations

Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

Anish M S Shrestha et al. BMC Genomics. .

Abstract

Background: RNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. For organisms that lack a well-annotated reference genome or transcriptome, a conventional RNA-seq data analysis workflow requires constructing a de-novo transcriptome assembly and annotating it against a high-confidence protein database. The assembly serves as a reference for read mapping, and the annotation is necessary for functional analysis of genes found to be differentially expressed. However, assembly is computationally expensive. It is also prone to errors that impact expression analysis, especially since sequencing depth is typically much lower for expression studies than for transcript discovery.

Results: We propose a shortcut, in which we obtain counts for differential expression analysis by directly aligning RNA-seq reads to the high-confidence proteome that would have been otherwise used for annotation. By avoiding assembly, we drastically cut down computational costs - the running time on a typical dataset improves from the order of tens of hours to under half an hour, and the memory requirement is reduced from the order of tens of Gbytes to tens of Mbytes. We show through experiments on simulated and real data that our pipeline not only reduces computational costs, but has higher sensitivity and precision than a typical assembly-based pipeline. A Snakemake implementation of our workflow is available at: https://bitbucket.org/project_samar/samar .

Conclusions: The flip side of RNA-seq becoming accessible to even modestly resourced labs has been that the time, labor, and infrastructure cost of bioinformatics analysis has become a bottleneck. Assembly is one such resource-hungry process, and we show here that it can be avoided for quick and easy, yet more sensitive and precise, differential gene expression analysis in non-model organisms.

Keywords: DNA-protein alignment; Differential gene expression analysis; Non-model organisms; RNA-seq; Transcriptome assembly.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
(Left) Precision-recall curves of our method using the D. melanogaster proteome reference and the Bowtie2-RSEM-DESeq2 pipeline using its transcriptome reference. The three shape markers in each curve correspond to setting the FDR threshold of DESeq2 to 0.01, 0.05, and 0.1. (Right) Log fold change of true positive DE genes estimated by DESeq2 at FDR threshold of 0.1, compared against the 6 simulated log-fold change levels
Fig. 2
Fig. 2
Precision-Recall curves of our method and the assembly-based pipeline, when using reference proteomes of close relatives (D. ananassae and D. grimshawi) and a distant relative (A. gambiae). The three shape markers in each curve correspond to setting the FDR values of 0.01, 0.05, and 0.1
Fig. 3
Fig. 3
Precision-Recall curves of our method and the assembly-based pipeline, when using reference proteomes of close relatives (D. ananassae and D. grimshawi) and a distant relative (A. gambiae), and with the alignments produced by Dammit which covered less than 50% of the length of the contig removed. The three shape markers in each curve correspond to setting the FDR values of 0.01, 0.05, and 0.1
Fig. 4
Fig. 4
For the three reference proteomes, Venn diagrams showing the intersections among the Baseline set consisting of DE genes called by the baseline approach of Bowtie2-RSEM-DESeq2 using the D. melanogaster reference transcriptome, Our set consisting of D. melanogaster genes to which the DE genes called by our approach mapped to, and (3) Assembly-based set consisting of D. melanogaster genes to which Trinity DE genes mapped to. FDR threshold of 0.01 was used for all three approaches

References

    1. Stark R, Grzelak M, Hadfield J. Rna sequencing: the teenage years. Nat Rev Genet. 2019;20:631–56. - PubMed
    1. Salzberg SL. Next-generation genome annotation: we still struggle to get it right. Genome Biol. 2019; 20(1). 10.1186/s13059-019-1715-2. - PMC - PubMed
    1. Song J, McDowell JR. Comparative transcriptomics of spotted seatrout (cynoscion nebulosus) populations to cold and heat stress. Ecol Evol. 2020;11(3):1352–67. - PMC - PubMed
    1. Hou J, Xu T, Su D, Wu Y, Cheng L, Wang J, Zhou Z, Wang Y. RNA-seq reveals extensive transcriptional response to heat stress in the stony coral galaxea fascicularis. Front Genet. 2018; 9. 10.3389/fgene.2018.00037. - PMC - PubMed
    1. Hu Z, Zhang Y, He Y, Cao Q, Zhang T, Lou L, Cai Q. Full-length transcriptome assembly of italian ryegrass root integrated with RNA-seq to identify genes in response to plant cadmium stress. Int J Mol Sci. 2020;21(3):1067. - PMC - PubMed

LinkOut - more resources