Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Peng Liu¹, Jessica Ewald¹, Jose Hector Galvez^{2

3}, Jessica Head¹, Doug Crump⁴, Guillaume Bourque^{2

3}, Niladri Basu¹, Jianguo Xia^{1

2}

Affiliations

¹ Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada.
² Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.
³ Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada.
⁴ Environment and Climate Change Canada, National Wildlife Research Centre, Ottawa, Ontario K1A 0H3, Canada.

PMID: 33731361
PMCID: PMC8015844
DOI: 10.1101/gr.269894.120

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Peng Liu et al. Genome Res. 2021 Apr.

. 2021 Apr;31(4):713-720.

doi: 10.1101/gr.269894.120. Epub 2021 Mar 17.

Authors

Peng Liu¹, Jessica Ewald¹, Jose Hector Galvez^{2

3}, Jessica Head¹, Doug Crump⁴, Guillaume Bourque^{2

3}, Niladri Basu¹, Jianguo Xia^{1

2}

Affiliations

¹ Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada.
² Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.
³ Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada.
⁴ Environment and Climate Change Canada, National Wildlife Research Centre, Ottawa, Ontario K1A 0H3, Canada.

PMID: 33731361
PMCID: PMC8015844
DOI: 10.1101/gr.269894.120

Abstract

Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of the Seq2Fun workflow. Seq2Fun accepts raw RNA-seq reads and generates various expression count tables. There are three main phases: quality control; translated search; and expression quantification. Seq2Fun starts by loading read pack (n = 10,000 raw RNA-seq reads), followed by trimming, adaptor and poly(A) tail removal, overlapped paired-end reads merging, and sequence error correction; cleaned reads are translated into all possible amino acid sequences, and the longest fragments are subjected to search in a protein database based on FM-index to identify the most likely functional homologs either by maximum exact match (MEM) or Greedy mode. Each matched read is assigned with protein ID(s), followed by mapping each protein ID with the KEGG ortholog ID, and finally summing each KEGG ortholog to produce a KEGG ortholog abundance table, pathway hit table, species hit table, and KEGG ortholog reads table. An HTML report is also generated to summarize and visualize read qualities and results tables. Cleaned reads labeled with mapped KEGG orthologs are also retrieved.

**Figure 2.**
Significant pathways identified from RNA-seq data of double-crested cormorant (DCCO). The pathways are visualized by ridgeline plots. The distribution for each pathway is colored according to the pathway's adjusted P-value. The vertical gray lines indicate the log₂FC values of all genes in the enriched pathways.

See this image and copyright information in PMC

References

1. Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. 2020. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36: 2251–2252. 10.1093/bioinformatics/btz859 - DOI - PMC - PubMed
1. Bentley SD, Parkhill J. 2004. Comparative genomic structure of prokaryotes. Annu Rev Genet 38: 771–791. 10.1146/annurev.genet.38.072902.094318 - DOI - PubMed
1. Bravo E, Cantafora A, Cicchini C, Avella M, Botham KM. 1999. The influence of estrogen on hepatic cholesterol metabolism and biliary lipid secretion in rats fed fish oil. Biochim Biophys Acta 1437: 367–377. 10.1016/S1388-1981(99)00019-0 - DOI - PubMed
1. Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12: 59–60. 10.1038/nmeth.3176 - DOI - PubMed
1. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X. 2015. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 16: 30. 10.1186/s13059-015-0596-2 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Affiliations

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases