Challenges and advances for transcriptome assembly in non-model species

Arnaud Ungaro¹, Nicolas Pech¹, Jean-François Martin², R J Scott McCairns^{1

3}, Jean-Philippe Mévy⁴, Rémi Chappaz¹, André Gilles¹

Affiliations

¹ UMR 7263, Équipe Évolution Génome Environnement, Aix Marseille Université, CNRS, IRD, IMBE, Marseille, France.
² UMR Centre de Biologie pour la Gestion des Populations, Montpellier SupAgro, Montferrier-sur-Lez, France.
³ ESE, Ecology and Ecosystem Health, INRA, Agrocampus Ouest, Rennes, France.
⁴ UMR 7263, Équipe Diversité Fonctionnement: des molécules aux écosystèmes, Aix Marseille Université, CNRS, IRD, IMBE, Marseille, France.

PMID: 28931057
PMCID: PMC5607178
DOI: 10.1371/journal.pone.0185020

Challenges and advances for transcriptome assembly in non-model species

Arnaud Ungaro et al. PLoS One. 2017.

. 2017 Sep 20;12(9):e0185020.

doi: 10.1371/journal.pone.0185020. eCollection 2017.

Authors

Arnaud Ungaro¹, Nicolas Pech¹, Jean-François Martin², R J Scott McCairns^{1

3}, Jean-Philippe Mévy⁴, Rémi Chappaz¹, André Gilles¹

Affiliations

¹ UMR 7263, Équipe Évolution Génome Environnement, Aix Marseille Université, CNRS, IRD, IMBE, Marseille, France.
² UMR Centre de Biologie pour la Gestion des Populations, Montpellier SupAgro, Montferrier-sur-Lez, France.
³ ESE, Ecology and Ecosystem Health, INRA, Agrocampus Ouest, Rennes, France.
⁴ UMR 7263, Équipe Diversité Fonctionnement: des molécules aux écosystèmes, Aix Marseille Université, CNRS, IRD, IMBE, Marseille, France.

PMID: 28931057
PMCID: PMC5607178
DOI: 10.1371/journal.pone.0185020

Abstract

Analyses of high-throughput transcriptome sequences of non-model organisms are based on two main approaches: de novo assembly and genome-guided assembly using mapping to assign reads prior to assembly. Given the limits of mapping reads to a reference when it is highly divergent, as is frequently the case for non-model species, we evaluate whether using blastn would outperform mapping methods for read assignment in such situations (>15% divergence). We demonstrate its high performance by using simulated reads of lengths corresponding to those generated by the most common sequencing platforms, and over a realistic range of genetic divergence (0% to 30% divergence). Here we focus on gene identification and not on resolving the whole set of transcripts (i.e. the complete transcriptome). For simulated datasets, the transcriptome-guided assembly based on blastn recovers 94.8% of genes irrespective of read length at 0% divergence; however, assignment rate of reads is negatively correlated with both increasing divergence level and reducing read lengths. Nevertheless, we still observe 92.6% of recovered genes at 30% divergence irrespective of read length. This analysis also produces a categorization of genes relative to their assignment, and suggests guidelines for data processing prior to analyses of comparative transcriptomics and gene expression to minimize potential inferential bias associated with incorrect transcript assignment. We also compare the performances of de novo assembly alone vs in combination with a transcriptome-guided assembly based on blastn both via simulation and empirically, using data from a cyprinid fish species and from an oak species. For any simulated scenario, the transcriptome-guided assembly using blastn outperforms the de novo approach alone, including when the divergence level is beyond the reach of traditional mapping methods. Combining de novo assembly and a related reference transcriptome for read assignment also addresses the bias/error in contigs caused by the dependence on a related reference alone. Empirical data corroborate these findings when assembling transcriptomes from the two non-model organisms: Parachondrostoma toxostoma (fish) and Quercus pubescens (plant). For the fish species, out of the 31,944 genes known from D. rerio, the guided and de novo assemblies recover respectively 20,605 and 20,032 genes but the performance of the guided assembly approach is much higher for both the contiguity and completeness metrics. For the oak, out of the 29,971 genes known from Vitis vinifera, the transcriptome-guided and de novo assemblies display similar performance, but the new guided approach detects 16,326 genes where the de novo assembly only detects 9,385 genes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: AU was supported by Electricité de France, a commercial company. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. Transcriptome-guided assembly pipeline.**
We present the pipeline for transcriptome assembly and contig assignment. (A) The transcriptome-guided assembly *per se* (green background) combines a **read assignment step** based on blastn, making use of a local database merging the related reference transcriptome and *de novo* assembly of the query transcriptome. (B) The *de novo* assembly (red background) with a blastn annotation of generated contigs to obtain a *de novo* assembly for the inferred transcriptome. Cylinders represent the database, rectangles represent analytical steps in the processes, and parallelograms correspond to results. Note that the dashed lines represent optional steps in the pipelines.

Fig 2. Predicted recovery rate (rr) using the mixed logistic model as a function of gene length (rr), read length and divergence between target and reference transcriptomes: red lines denote 0% divergence; green 5%; blue 15% & purple 30%.
Solid lines correspond to the median of predictions, conditioned on random variation among genes, with 80% prediction intervals indicated by dashed lines. Read length increases downward, and panels to the right represent a magnified view of the upper 5th quantile of rr scores to better visualize differences between low divergent sequences: 100 base reads (A & B); 150 bases (C & D); 200 bases (E & F); 350 bases (G & H).

**Fig 3. Proportion of gene types recovered in divergent simulations by size-class of gene.**
(A) Histogram of gene lengths for the *Danio rerio* transcriptome used for simulating RNA-seq reads. (B) Number of genes from A, by size-class in subsequent windows. 100 base reads are plotted with 0% divergence (C) and 30% divergence (D); 350 base reads with 0% divergence (E) and 30% divergence (F). Gene types are described in the legend.

**Fig 4. Efficiency and performance of *de novo* and transcriptome-guided assembly based on Blastn by read length and divergence level.**
(A) Number of identified genes; (B) completeness; (C) contiguity. Divergence between target and reference transcriptomes is described in the figure legend.

Fig 5. Nonparametric estimation of gene density as a function of contiguity (x-axis) and completeness score (y-axis) obtained from *de novo* and transcriptome-guided assembly pipelines for 0% divergence.
Colors increasing from yellow to dark red denote increasing gene densities. Note also that for clarity of visualization, only the non-perfect fraction of genes are displayed. 100 base reads assembled *de novo* (A) and with guided assembly (B); 200 base reads *de novo* (C) and guided assembly (D). The proportion of non-perfect genes is indicated at the top of each panel.

See this image and copyright information in PMC

References

1. Nikinmaa M, McCairns RJS, Nikinmaa MW, Vuori KA, Kanerva M, Leinonen T, et al. Transcription and redox enzyme activities: comparison of equilibrium and disequilibrium levels in the three-spined stickleback. Proceedings of the Royal Society B: Biological Sciences. 2013;280: 20122974–20122974. doi: 10.1098/rspb.2012.2974 - DOI - PMC - PubMed
1. Bar-Even A, Paulsson J, Maheshri N, Carmi M, O’Shea E, Pilpel Y, et al. Noise in protein expression scales with natural protein abundance. Nat Genet. 2006;38: 636–643. doi: 10.1038/ng1807 - DOI - PubMed
1. Alvarado S, Rajakumar R, Abouheif E, Szyf M. Epigenetic variation in the Egfr gene generates quantitative variation in a complex trait in ants. Nat Commun. 2015;6: 6513 doi: 10.1038/ncomms7513 - DOI - PubMed
1. Ayroles JF, Carbone MA, Stone EA, Jordan KW, Lyman RF, Magwire MM, et al. Systems genetics of complex traits in Drosophila melanogaster. Nat Genet. 2009;41: 299–307. doi: 10.1038/ng.332 - DOI - PMC - PubMed
1. Leder EH, McCairns RJS, Leinonen T, Cano JM, Viitaniemi HM, Nikinmaa M, et al. The evolution and adaptive potential of transcriptional variation in sticklebacks—signatures of selection and widespread heritability. Mol Biol Evol. 2015;32: 674–689. doi: 10.1093/molbev/msu328 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges and advances for transcriptome assembly in non-model species

Affiliations

Challenges and advances for transcriptome assembly in non-model species

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources