Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr;31(4):713-720.
doi: 10.1101/gr.269894.120. Epub 2021 Mar 17.

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Affiliations

Ultrafast functional profiling of RNA-seq data for nonmodel organisms

Peng Liu et al. Genome Res. 2021 Apr.

Abstract

Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the Seq2Fun workflow. Seq2Fun accepts raw RNA-seq reads and generates various expression count tables. There are three main phases: quality control; translated search; and expression quantification. Seq2Fun starts by loading read pack (n = 10,000 raw RNA-seq reads), followed by trimming, adaptor and poly(A) tail removal, overlapped paired-end reads merging, and sequence error correction; cleaned reads are translated into all possible amino acid sequences, and the longest fragments are subjected to search in a protein database based on FM-index to identify the most likely functional homologs either by maximum exact match (MEM) or Greedy mode. Each matched read is assigned with protein ID(s), followed by mapping each protein ID with the KEGG ortholog ID, and finally summing each KEGG ortholog to produce a KEGG ortholog abundance table, pathway hit table, species hit table, and KEGG ortholog reads table. An HTML report is also generated to summarize and visualize read qualities and results tables. Cleaned reads labeled with mapped KEGG orthologs are also retrieved.
Figure 2.
Figure 2.
Significant pathways identified from RNA-seq data of double-crested cormorant (DCCO). The pathways are visualized by ridgeline plots. The distribution for each pathway is colored according to the pathway's adjusted P-value. The vertical gray lines indicate the log2FC values of all genes in the enriched pathways.

References

    1. Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. 2020. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36: 2251–2252. 10.1093/bioinformatics/btz859 - DOI - PMC - PubMed
    1. Bentley SD, Parkhill J. 2004. Comparative genomic structure of prokaryotes. Annu Rev Genet 38: 771–791. 10.1146/annurev.genet.38.072902.094318 - DOI - PubMed
    1. Bravo E, Cantafora A, Cicchini C, Avella M, Botham KM. 1999. The influence of estrogen on hepatic cholesterol metabolism and biliary lipid secretion in rats fed fish oil. Biochim Biophys Acta 1437: 367–377. 10.1016/S1388-1981(99)00019-0 - DOI - PubMed
    1. Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12: 59–60. 10.1038/nmeth.3176 - DOI - PubMed
    1. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X. 2015. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 16: 30. 10.1186/s13059-015-0596-2 - DOI - PMC - PubMed

Publication types