Optimization of de novo transcriptome assembly from next-generation sequencing data
- PMID: 20693479
- PMCID: PMC2945192
- DOI: 10.1101/gr.103846.109
Optimization of de novo transcriptome assembly from next-generation sequencing data
Abstract
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.
Figures

Similar articles
-
Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics.Mol Biol Evol. 2009 Dec;26(12):2731-44. doi: 10.1093/molbev/msp188. Epub 2009 Aug 25. Mol Biol Evol. 2009. PMID: 19706727
-
Challenges and advances for transcriptome assembly in non-model species.PLoS One. 2017 Sep 20;12(9):e0185020. doi: 10.1371/journal.pone.0185020. eCollection 2017. PLoS One. 2017. PMID: 28931057 Free PMC article.
-
Comparative performance of transcriptome assembly methods for non-model organisms.BMC Genomics. 2016 Jul 27;17:523. doi: 10.1186/s12864-016-2923-8. BMC Genomics. 2016. PMID: 27464550 Free PMC article.
-
[Transcript assembly and quality assessment].Sheng Wu Gong Cheng Xue Bao. 2015 Sep;31(9):1271-8. Sheng Wu Gong Cheng Xue Bao. 2015. PMID: 26955705 Review. Chinese.
-
Sequence assembly using next generation sequencing data--challenges and solutions.Sci China Life Sci. 2014 Nov;57(11):1140-8. doi: 10.1007/s11427-014-4752-9. Epub 2014 Oct 17. Sci China Life Sci. 2014. PMID: 25326069 Review.
Cited by
-
Evaluating methods for isolating total RNA and predicting the success of sequencing phylogenetically diverse plant transcriptomes.PLoS One. 2012;7(11):e50226. doi: 10.1371/journal.pone.0050226. Epub 2012 Nov 21. PLoS One. 2012. PMID: 23185583 Free PMC article.
-
A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration.Genome Biol. 2013 Feb 20;14(2):R16. doi: 10.1186/gb-2013-14-2-r16. Genome Biol. 2013. PMID: 23425577 Free PMC article.
-
Global insights into high temperature and drought stress regulated genes by RNA-Seq in economically important oilseed crop Brassica juncea.BMC Plant Biol. 2015 Jan 21;15:9. doi: 10.1186/s12870-014-0405-1. BMC Plant Biol. 2015. PMID: 25604693 Free PMC article.
-
Combining different mRNA capture methods to analyze the transcriptome: analysis of the Xenopus laevis transcriptome.PLoS One. 2013 Oct 15;8(10):e77700. doi: 10.1371/journal.pone.0077700. eCollection 2013. PLoS One. 2013. PMID: 24143257 Free PMC article.
-
BHap: a novel approach for bacterial haplotype reconstruction.Bioinformatics. 2019 Nov 1;35(22):4624-4631. doi: 10.1093/bioinformatics/btz280. Bioinformatics. 2019. PMID: 31004480 Free PMC article.
References
-
- Carninci P 2008. Hunting hidden transcripts. Nat Methods 5: 587–589 - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous