Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 27:17:523.
doi: 10.1186/s12864-016-2923-8.

Comparative performance of transcriptome assembly methods for non-model organisms

Affiliations

Comparative performance of transcriptome assembly methods for non-model organisms

Xin Huang et al. BMC Genomics. .

Abstract

Background: The technological revolution in next-generation sequencing has brought unprecedented opportunities to study any organism of interest at the genomic or transcriptomic level. Transcriptome assembly is a crucial first step for studying the molecular basis of phenotypes of interest using RNA-Sequencing (RNA-Seq). However, the optimal strategy for assembling vast amounts of short RNA-Seq reads remains unresolved, especially for organisms without a sequenced genome. This study compared four transcriptome assembly methods, including a widely used de novo assembler (Trinity), two transcriptome re-assembly strategies utilizing proteomic and genomic resources from closely related species (reference-based re-assembly and TransPS) and a genome-guided assembler (Cufflinks).

Results: These four assembly strategies were compared using a comprehensive transcriptomic database of Aedes albopictus, for which a genome sequence has recently been completed. The quality of the various assemblies was assessed by the number of contigs generated, contig length distribution, percent paired-end read mapping, and gene model representation via BLASTX. Our results reveal that de novo assembly generates a similar number of gene models relative to genome-guided assembly with a fragmented reference, but produces the highest level of redundancy and requires the most computational power. Using a closely related reference genome to guide transcriptome assembly can generate biased contig sequences. Increasing the number of reads used in the transcriptome assembly tends to increase the redundancy within the assembly and decrease both median contig length and percent identity between contigs and reference protein sequences.

Conclusions: This study provides general guidance for transcriptome assembly of RNA-Seq data from organisms with or without a sequenced genome. The optimal transcriptome assembly strategy will depend upon the subsequent downstream analyses. However, our results emphasize the efficacy of de novo assembly, which can be as effective as genome-guided assembly when the reference genome assembly is fragmented. If a genome assembly and sufficient computational resources are available, it can be beneficial to combine de novo and genome-guided assemblies. Caution should be taken when using a closely related reference genome to guide transcriptome assembly. The quantity of read pairs used in the transcriptome assembly does not necessarily correlate with the quality of the assembly.

Keywords: Aedes albopictus; De novo assembly; Genome-guided assembly; Next-generation sequencing; Non-model organisms; Reference-based re-assembly; TransPS; Transcriptome assembly.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Flow chart for the experimental design of this study. Three datasets, 180 M, 360 M and 600 M, were used for each of the four assembly strategies described in the text. Subsequently, all fourteen assemblies were assessed by the metrics contig number and length distribution, percent paired-end reads mapped back to the assembly and gene model representation. De novo stands for de novo assembly, Ref-based for reference-based re-assembly, TransPS for transcriptome post-scaffolding, and G. guided for genome-guided assembly
Fig. 2
Fig. 2
Percentage of paired-end reads mapping back to the assembly. Datasets (180 M, 360 M, 600 M) and assembly strategies as described in Fig. 1, GG/Albo refers to genome-guided assembly using the Ae. albopictus reference genome and GG/Aeg refers to genome-guided assembly using the Ae. aegypti reference genome. +RA refers to genome-guided assembly using the Ae. aegypti genome with reference annotation, and -RA refers to genome-guided assembly using the Ae. aegypti genome without reference annotation
Fig. 3
Fig. 3
Number of gene models identified from the Ae. aegypti reference protein set in all assemblies. Datasets and assembly strategies as in Fig. 2
Fig. 4
Fig. 4
Intersection of Ae. aegypti gene models identified by all assembly strategies using the 180 M dataset. Assembly strategies as in Fig. 2, except that G.guided refers to genome-guided assembly using the Ae. albopictus reference genome

References

    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Genome 10K Project. https://genome10k.soe.ucsc.edu/. Accessed 7 Apr 2015.
    1. i5k Genome Sequencing Initiative for Insects and Other Arthropods. http://www.arthropodgenomes.org/wiki/i5K. Accessed 7 Apr 2015.
    1. Yang HJ, Ratnapriya R, Cogliati T, Kim JW, Swaroop A. Vision from next generation sequencing: Multi-dimensional genome-wide analysis for producing gene regulatory networks underlying retinal development, aging and disease. Prog Retin Eye Res. 2015;46:1–30. doi: 10.1016/j.preteyeres.2015.01.005. - DOI - PMC - PubMed
    1. Elmer KR, Meyer A. Adaptation in the age of ecological genomics: insights from parallelism and convergence. Trends Ecol Evol. 2011;26:298–306. doi: 10.1016/j.tree.2011.02.008. - DOI - PubMed

Publication types

LinkOut - more resources