Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Mar 10;23(2):bbab563.
doi: 10.1093/bib/bbab563.

A simple guide to de novo transcriptome assembly and annotation

Affiliations
Review

A simple guide to de novo transcriptome assembly and annotation

Venket Raghavan et al. Brief Bioinform. .

Abstract

A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.

Keywords: RNA-seq; annotation; assembly; de novo; tools; transcriptome.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Assembly and annotation workflow. (A) Quality control of the raw reads by filtering for erroneous reads and sequencing artifacts. (B) Sequence assembly including clustering into groups of isoforms and removing redundant sequences (isoforms are transcript variants arising from alternative splicing). (C) Mapping the raw reads to the assembled sequences for either quality control of the assembly or for differential expression analysis. (D) Applying statistical tests for identification of changes in expression levels. (E) Classifying sequences by RNA species and translating into protein sequences before annotation. (F) Annotating sequences on the basis of sequence similarity, identifying sequence features (such as functional domains) and annotating Gene Ontology terms.
Figure 2
Figure 2
Short-read quality control and data cleansing involve procedures such as adapter trimming, removing short reads and erroneous reads containing N-bases, read correction by comparison to other reads, and excluding reads originating from contaminant sources (e.g. pathogens in a host species). In silico read normalization can be a useful pre-processing step for very large data sets (>200M reads) where it can significantly improve assembler performance by selectively reducing the reads in a manner such that the transcriptomic complexity of the original data set is retained.
Figure 3
Figure 3
A typical graph-based approach to de novo transcriptome assembly. The basic idea is to establish a catalog of sub-strings from the RNA-seq reads, and compose these into a graph (or set of graphs) wherein the sub-strings are connected if an overlap between them exists. This establishes paths through the graph(s) which correspond to the transcripts the reads (potentially) originated from. (A) Short nucleotide reads 50–250 nt in length are the inputs for the assembly process. If paired-end reads are supplied, the respective mates are merged into a single contiguous read prior to assembly. Highlighted here is a 6 nt portion of a single read (CGTTAG). (B) For each read, all possible sub-sequences of length k are generated (k-mers). The 4-mers (k = 4) originating from the 6 nt nucleotide fragment from the previous step are indicated here as examples. (C) Subsequently, each k-mer becomes a node (also called vertex) in the graph, and an edge is established between any two nodes that share a k-1 nucleotide overlap with each other. Edges are established between any two nodes that satisfy this overlap requirement. As a simplified example, an edge connecting the first and second 4-mers from the previous step is highlighted here as existing as a part of a De Brujin graph. (D) Finally, different paths through the graph(s) are traversed and recovered as independent sequences. Not all paths through the graph are recovered; the subset of paths that represent valid transcripts is determined algorithmically.
Figure 4
Figure 4
Transcriptome functional annotation comprises of techniques to assign human-comprehensible identifiers and functional characteristics to the transcripts. It includes searching for homologs based on sequence similarities and identifying assembled sequences (homology transfer), domain and other sequence feature identification (sequence feature annotation) and assigning standardized descriptors for the sequences’ biological properties (Gene Ontology terms).

References

    1. Buccitelli C, Selbach M. mRNAs, proteins and the emerging principles of gene expression control. Nat Rev Genet October 2020;21(10):630–44. - PubMed
    1. Schimmel P. The emerging complexity of the tRNA world: mammalian tRNAs beyond protein synthesis. Nat Rev Mol Cell Biol January 2018;19(1):45–58. - PubMed
    1. Statello L, Guo C-J, Chen L-L, et al. Gene regulation by long non-coding RNAs and its biological functions. Nat Rev Mol Cell Biol February 2021;22(2):96–118. - PMC - PubMed
    1. Holoch D, Moazed D. RNA-mediated epigenetic regulation of gene expression. Nat Rev Genet February 2015;16(2):71–84. - PMC - PubMed
    1. Li J, Liu C. Coding or noncoding, the converging concepts of RNAs. Front Genet May 2019;10:496. - PMC - PubMed