A consensus-based ensemble approach to improve transcriptome assembly

Adam Voshall^{1

2

3}, Sairam Behera^{2

4}, Xiangjun Li^{5

6}, Xiao-Hong Yu⁷, Kushagra Kapil², Jitender S Deogun², John Shanklin⁸, Edgar B Cahoon^{5

6}, Etsuko N Moriyama^{9

10}

Affiliations

¹ School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
² Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
³ Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital/Harvard Medical School, Boston, MA, 02115, USA.
⁴ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
⁵ Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
⁶ Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
⁷ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, 11794, USA.
⁸ Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA.
⁹ School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. emoriyama2@unl.edu.
¹⁰ Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. emoriyama2@unl.edu.

PMID: 34674629
PMCID: PMC8532302
DOI: 10.1186/s12859-021-04434-8

A consensus-based ensemble approach to improve transcriptome assembly

Adam Voshall et al. BMC Bioinformatics. 2021.

. 2021 Oct 21;22(1):513.

doi: 10.1186/s12859-021-04434-8.

Authors

Adam Voshall^{1

2

3}, Sairam Behera^{2

4}, Xiangjun Li^{5

6}, Xiao-Hong Yu⁷, Kushagra Kapil², Jitender S Deogun², John Shanklin⁸, Edgar B Cahoon^{5

6}, Etsuko N Moriyama^{9

10}

Affiliations

¹ School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
² Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
³ Department of Pediatrics, Division of Genetics and Genomics, Boston Children's Hospital/Harvard Medical School, Boston, MA, 02115, USA.
⁴ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA.
⁵ Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
⁶ Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA.
⁷ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, 11794, USA.
⁸ Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA.
⁹ School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. emoriyama2@unl.edu.
¹⁰ Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA. emoriyama2@unl.edu.

PMID: 34674629
PMCID: PMC8532302
DOI: 10.1186/s12859-021-04434-8

Abstract

Background: Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes.

Results: In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble.

Conclusions: Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genome-guided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http://bioinfolab.unl.edu/emlab/consemble/ .

Keywords: Benchmarking; De novo assembly; Ensemble assembly; Genome-guided assembly; Illumina; RNAseq; Simulation; Transcriptome assembly.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Numbers of assembled contigs shared between the four de novo assemblers. The numbers of correctly (black) and incorrectly (red) assembled contigs are shown. The three benchmark datasets (No0-NoAlt, Col0-Alt, and Human HG38) were assembled by the four de novo methods. Results based on individual methods using the default settings are shown under "Individual". Results based on comparing the "pooled unique" contig sets assembled using the four methods with multiple kmer lengths are shown under "Pooled". All contigs were compared at the protein level. The outlined region represents where the shared correct and incorrect contigs were counted for the ConSemble3+d assembly (shown as TP and FP in Additional file 2: Table S6)

**Fig. 2**
Comparison of de novo assembler performance on the three benchmark datasets. For the individual de novo assemblers, results shown were obtained with their default settings. See Additional file 2: Tables S3 and S6 for details

**Fig. 3**
Numbers of assembled contigs shared between the four genome-guided assemblers. Rows and columns are based on the simulated RNAseq dataset and the reference genome used for the transcriptome assembly, respectively. The numbers of correctly (black) and incorrectly (red) assembled contigs are shown. All contigs were compared at the protein level. The outlined region represents where the shared correct and incorrect contigs were counted for the ConSemble3+g assembly using the same reference genomes (shown as TP and FP in Additional file 2: Table S9)

**Fig. 4**
Numbers of assembled contigs shared between de novo and genome-guided assemblies. The "Merged" assemblies in Additional file 2: Table S5 were used for the de novo assembly datasets. The genome-guided assembly is the union set of the assemblies generated by the four genome-guided methods using the same reference genomes (Additional file 2: Tests 4, 6, and 8 in Table S2). The numbers of correctly (black) and incorrectly (red) assembled contigs are shown. All contigs were compared at the protein level

**Fig. 5**
Comparison of genome-guided assembler performance on the three benchmark datasets. See Additional file 2: Tables S7 and S9 for details

See this image and copyright information in PMC

References

1. Huang X, Chen XG, Armbruster PA. Comparative performance of transcriptome assembly methods for non-model organisms. BMC Genom. 2016;17:523. doi: 10.1186/s12864-016-2923-8. - DOI - PMC - PubMed
1. Wang S, Gribskov M. Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis. Bioinformatics. 2017;33(3):327–333. - PubMed
1. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szczesniak MW, Gaffney DJ, Elo LL, Zhang X, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13. doi: 10.1186/s13059-016-0881-8. - DOI - PMC - PubMed
1. Voshall A, Moriyama EN. Next-generation transcriptome assembly: strategies and performance analysis. In: Adburakhmonov IY, editor. Bioinformatics in the era of post genomics and big data. Rijeka: IntechOpen; 2018.
1. Simonis M, Atanur SS, Linsen S, Guryev V, Ruzius FP, Game L, Lansu N, de Bruijn E, van Heesch S, Jones SJ, et al. Genetic basis of transcriptome differences between the founder strains of the rat HXB/BXH recombinant inbred panel. Genome Biol. 2012;13(4):r31. doi: 10.1186/gb-2012-13-4-r31. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A consensus-based ensemble approach to improve transcriptome assembly

Affiliations

A consensus-based ensemble approach to improve transcriptome assembly

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources