Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep 20;12(9):e0185020.
doi: 10.1371/journal.pone.0185020. eCollection 2017.

Challenges and advances for transcriptome assembly in non-model species

Affiliations

Challenges and advances for transcriptome assembly in non-model species

Arnaud Ungaro et al. PLoS One. .

Abstract

Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn both via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of traditional mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate these findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance, but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: AU was supported by Electricité de France, a commercial company. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Transcriptome-guided assembly pipeline.
We present the pipeline for transcriptome assembly and contig assignment. (A) The transcriptome-guided assembly per se (green background) combines a read assignment step based on blastn, making use of a local database merging the related reference transcriptome and de novo assembly of the query transcriptome. (B) The de novo assembly (red background) with a blastn annotation of generated contigs to obtain a de novo assembly for the inferred transcriptome. Cylinders represent the database, rectangles represent analytical steps in the processes, and parallelograms correspond to results. Note that the dashed lines represent optional steps in the pipelines.
Fig 2
Fig 2. Predicted recovery rate (rr) using the mixed logistic model as a function of gene length (rr), read length and divergence between target and reference transcriptomes: red lines denote 0% divergence; green 5%; blue 15% & purple 30%.
Solid lines correspond to the median of predictions, conditioned on random variation among genes, with 80% prediction intervals indicated by dashed lines. Read length increases downward, and panels to the right represent a magnified view of the upper 5th quantile of rr scores to better visualize differences between low divergent sequences: 100 base reads (A & B); 150 bases (C & D); 200 bases (E & F); 350 bases (G & H).
Fig 3
Fig 3. Proportion of gene types recovered in divergent simulations by size-class of gene.
(A) Histogram of gene lengths for the Danio rerio transcriptome used for simulating RNA-seq reads. (B) Number of genes from A, by size-class in subsequent windows. 100 base reads are plotted with 0% divergence (C) and 30% divergence (D); 350 base reads with 0% divergence (E) and 30% divergence (F). Gene types are described in the legend.
Fig 4
Fig 4. Efficiency and performance of de novo and transcriptome-guided assembly based on Blastn by read length and divergence level.
(A) Number of identified genes; (B) completeness; (C) contiguity. Divergence between target and reference transcriptomes is described in the figure legend.
Fig 5
Fig 5. Nonparametric estimation of gene density as a function of contiguity (x-axis) and completeness score (y-axis) obtained from de novo and transcriptome-guided assembly pipelines for 0% divergence.
Colors increasing from yellow to dark red denote increasing gene densities. Note also that for clarity of visualization, only the non-perfect fraction of genes are displayed. 100 base reads assembled de novo (A) and with guided assembly (B); 200 base reads de novo (C) and guided assembly (D). The proportion of non-perfect genes is indicated at the top of each panel.

Similar articles

Cited by

References

    1. Nikinmaa M, McCairns RJS, Nikinmaa MW, Vuori KA, Kanerva M, Leinonen T, et al. Transcription and redox enzyme activities: comparison of equilibrium and disequilibrium levels in the three-spined stickleback. Proceedings of the Royal Society B: Biological Sciences. 2013;280: 20122974–20122974. doi: 10.1098/rspb.2012.2974 - DOI - PMC - PubMed
    1. Bar-Even A, Paulsson J, Maheshri N, Carmi M, O’Shea E, Pilpel Y, et al. Noise in protein expression scales with natural protein abundance. Nat Genet. 2006;38: 636–643. doi: 10.1038/ng1807 - DOI - PubMed
    1. Alvarado S, Rajakumar R, Abouheif E, Szyf M. Epigenetic variation in the Egfr gene generates quantitative variation in a complex trait in ants. Nat Commun. 2015;6: 6513 doi: 10.1038/ncomms7513 - DOI - PubMed
    1. Ayroles JF, Carbone MA, Stone EA, Jordan KW, Lyman RF, Magwire MM, et al. Systems genetics of complex traits in Drosophila melanogaster. Nat Genet. 2009;41: 299–307. doi: 10.1038/ng.332 - DOI - PMC - PubMed
    1. Leder EH, McCairns RJS, Leinonen T, Cano JM, Viitaniemi HM, Nikinmaa M, et al. The evolution and adaptive potential of transcriptional variation in sticklebacks—signatures of selection and widespread heritability. Mol Biol Evol. 2015;32: 674–689. doi: 10.1093/molbev/msu328 - DOI - PMC - PubMed

LinkOut - more resources