Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 16:5:e2988.
doi: 10.7717/peerj.2988. eCollection 2017.

Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies

Affiliations

Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies

Cédric Cabau et al. PeerJ. .

Abstract

Background: De novo transcriptome assembly of short reads is now a common step in expression analysis of organisms lacking a reference genome sequence. Several software packages are available to perform this task. Even if their results are of good quality it is still possible to improve them in several ways including redundancy reduction or error correction. Trinity and Oases are two commonly used de novo transcriptome assemblers. The contig sets they produce are of good quality. Still, their compaction (number of contigs needed to represent the transcriptome) and their quality (chimera and nucleotide error rates) can be improved.

Results: We built a de novo RNA-Seq Assembly Pipeline (DRAP) which wraps these two assemblers (Trinity and Oases) in order to improve their results regarding the above-mentioned criteria. DRAP reduces from 1.3 to 15 fold the number of resulting contigs of the assemblies depending on the read set and the assembler used. This article presents seven assembly comparisons showing in some cases drastic improvements when using DRAP. DRAP does not significantly impair assembly quality metrics such are read realignment rate or protein reconstruction counts.

Conclusion: Transcriptome assembly is a challenging computational task even if good solutions are already available to end-users, these solutions can still be improved while conserving the overall representation and quality of the assembly. The de novo RNA-Seq Assembly Pipeline (DRAP) is an easy to use software package to produce compact and corrected transcript set. DRAP is free, open-source and available under GPL V3 license at http://www.sigenae.org/drap.

Keywords: Compaction; Correction; De novo assembly; Quality assessment; RNA-Seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Steps in runDRAP workflow.
This workflow is used to produce an assembly from one sample/tissue/development stage. It take as input R1 from single-end sequencing or R1 and R2 from paired-end sequencing and eventually a reference proteins set from closest species with known proteins.
Figure 2
Figure 2. Steps in runMeta workflow.
This workflow is used to produce a merged assembly from several samples/tissues/development stage outputted by runDRAP. Inputs are runDRAP output folders and eventually a reference protein set.
Figure 3
Figure 3. Steps in runAssessment workflow.
This workflow is used to evaluate quality for one assembly or for compare several assemblies produced from the same dataset. Inputs are the assembly/ies, R1 and eventually R2, and a reference protein set.
Figure 4
Figure 4. Number of contigs.
The figure shows for the different assemblers (Oases, DRAP Oases, Trinity, DRAP Trinity) the number of contigs produced for each dataset.
Figure 5
Figure 5. Consensus error rates.
(A) presents the ratio of the global error rates between raw and DRAP assemblies for each dataset (data from Table 2 colum 12). (B), (C) and (D) present the ratio of the error rates respectively for substitution, insertions and deletions between raw and DRAP assemblies for each dataset (data from Table S2).
Figure 6
Figure 6. Reads re-alignment rates.
(A) and (B) show respectively the alignment rates for reads and read pairs for the four assemblies of each dataset.
Figure 7
Figure 7. Proteins realignment rates.
The figure shows the number of proteins which have been aligned on the contig sets with more than 80% identity and 80% coverage for each assembler and dataset.
Figure 8
Figure 8. TransRate scores.
The figure presents the TransRate scores of the four assemblers for each dataset.
Figure 9
Figure 9. Gene reconstruction versus expression depth using simulated reads.
The figure presents the proportion of correctly build transcripts (method presented in Data S1 section “Contig validation using exon re-alignment and order checking”) versus the read count per transcript.

References

    1. Bens M, Sahm A, Groth M, Jahn N, Morhart M, Holtze S, Hildebrandt TB, Platzer M, Szafranski K. FRAMA: from RNA-seq data to annotated mRNA assemblies. BMC Genomics. 2016;17:54. doi: 10.1186/s12864-015-2349-8. - DOI - PMC - PubMed
    1. Chelly J, Concordet JP, Kaplan JC, Kahn A. Illegitimate transcription: transcription of any gene in any cell type. Proceedings of the National Academy of Sciences of the United States of America. 1989;86:2617–2621. doi: 10.1073/pnas.86.8.2617. - DOI - PMC - PubMed
    1. Davidson NM, Oshlack A. Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Biology. 2014;15(7):410. - PMC - PubMed
    1. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Rde M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guig R, Gingeras TR. Landscape of transcription in human cells. Nature. 2012;489:101–108. doi: 10.1038/nature11233. - DOI - PMC - PubMed
    1. Frenkel-Morgenstern M, Gorohovski A, Lacroix V, Rogers M, Ibanez K, Boullosa C, Andres Leon E, Ben-Hur A, Valencia A. ChiTaRS: a database of human, mouse and fruit fly chimeric transcripts and RNA-sequencing data. Nucleic Acids Research. 2013;41:D142–D151. doi: 10.1093/nar/gks1041. - DOI - PMC - PubMed