Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 17;9(18):10513-10521.
doi: 10.1002/ece3.5571. eCollection 2019 Sep.

The Bellerophon pipeline, improving de novo transcriptomes and removing chimeras

Affiliations

The Bellerophon pipeline, improving de novo transcriptomes and removing chimeras

Jesse Kerkvliet et al. Ecol Evol. .

Abstract

Transcriptome quality control is an important step in RNA-Seq experiments. However, the quality of de novo assembled transcriptomes is difficult to assess, due to the lack of reference genome to compare the assembly to. We developed a method to assess and improve the quality of de novo assembled transcriptomes by focusing on the removal of chimeric sequences. These chimeric sequences can be the result of faulty assembled contigs, merging two transcripts into one. The developed method is incorporated into a pipeline, which we named Bellerophon, that is broadly applicable and easy to use. Bellerophon first uses the quality assessment tool TransRate to indicate the quality, after which it uses a transcripts per million (TPM) filter to remove lowly expressed contigs and CD-HIT-EST to remove highly identical contigs. To validate the quality of this method, we performed three benchmark experiments: (1) a computational creation of chimeras, (2) identification of chimeric contigs in a transcriptome assembly, (3) a simulated RNA-Seq experiment using a known reference transcriptome. Overall, the Bellerophon pipeline was able to remove between 40% and 91.9% of the chimeras in transcriptome assemblies and removed more chimeric than nonchimeric contigs. Thus, the Bellerophon sequence of filtration steps is a broadly applicable solution to improve transcriptome assemblies.

Keywords: chimera; transcriptome filtering; transcriptome quality assessment.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1
Figure 1
Bellerophon pipeline default filtration order. Violet paths use TransRate‐Q and BUSCO to assess assembly quality. Orange paths display the sequential filtering steps. The output of each filtering step is used as the input of the following filtering step
Figure 2
Figure 2
TransRate statistics for every filtering step to find the optimal filtration method. Every letter corresponds to a filtering experiment. The side‐table indicates the color and letter coding of the filters applied and their order. Ground bars are made by running Transrate‐Q on the unfiltered transcriptome. (a) TransRate assembly scores for each filtering experiment. Higher scores indicate higher quality. (b) Number of transcripts for each filtering experiment (c) Number of transcripts with less than one average per‐base coverage per filtering experiment. (d) Number of segmented (chimeric) contigs, that is, having un‐uniform expression patterns for each filtering experiment
Figure 3
Figure 3
Flowchart of chimeras in the pipeline after testing with 500 intentionally created chimeras. Orange numbers are the means numbers of chimeras in the assembly before and after each filtration steps. Bar charts represent the mean percentage of chimeras and other sequences removed by each step ± SEM
Figure 4
Figure 4
Flowchart of the number of contigs (chimeric or not) during the Drosophila melanogaster simulated RNA‐Seq experiment. Numbers in black boxes are the number of chimeras (orange) and other sequences (violet) left in the assembly at the different stages. The numbers in black arrows refer to the number of chimera or other sequences removed by each filtering step. The bar charts display the percentage of chimera (orange) or other sequences (violet) removed by each filtering step
Figure 5
Figure 5
Number of contigs (nonchimeric, chimeric or unidentified) and genes along filtering of the Drosophila melanogaster de novo transcriptome. (a) Flowchart of the number of chimera, correct, and unidentified sequences filtering by Bellerophon. Numbers in black boxes are the number of sequences present in the assembly at the different stages. Numbers for nonchimeric sequences are displayed in violet, in orange for chimeric (orange) or in black for unidentified sequences (sequences which could not be attributed to a gene through a blast).The numbers in black arrows refer to the number of nonchimeric, chimeric, or unidentified sequences removed by each filtering step. The bar charts display the percentage of nonchimeric (violet), chimeric (orange), or unidentified sequences (violet) removed by each filtering step. (b) Number nonchimeric, chimeric, and unidentified contigs along the differents step of the filtering of the de novo D. melanogaster transcriptome with the Bellerophon pipeline. (c) Number of genes represented by at least a contig in the de novo D. melanogaster transcriptome with the Bellerophon pipeline. Number of contigs per gene are display in violet (1 contig only), orange (between 2 and 9 contigs), and in black (10 contigs or more)

References

    1. Cabau, C. , Escudié, F. , Djari, A. , Guiguen, Y. , Bobe, J. , & Klopp, C. (2017). Compacting and correcting Trinity and Oases RNA‐Seq de novo assemblies. PeerJ, 5, e2988 10.7717/peerj.2988 - DOI - PMC - PubMed
    1. Frazee, A. C. , Jaffe, A. E. , Langmead, B. , & Leek, J. T. (2015). Polyester: Simulating RNA‐seq datasets with differential transcript expression. Bioinformatics, 31(17), 2778–2784. 10.1093/bioinformatics/btv272 - DOI - PMC - PubMed
    1. Frenkel‐Morgenstern, M. , Gorohovski, A. , Lacroix, V. , Rogers, M. , Ibanez, K. , Boullosa, C. , … Valencia, A. (2012). ChiTaRS: A database of human, mouse and fruit fly chimeric transcripts and RNA‐sequencing data. Nucleic Acids Research, 41(D1), D142–D151. 10.1093/nar/gks1041 - DOI - PMC - PubMed
    1. Gibilisco, L. , Zhou, Q. , Mahajan, S. , & Bachtrog, D. (2016). Alternative splicing within and between Drosophila species, sexes, tissues, and developmental stages. PLOS Genetics, 12(12), e1006464 10.1371/journal.pgen.1006464 - DOI - PMC - PubMed
    1. Grabherr, M. G. , Haas, B. J. , Yassour, M. , Levin, J. Z. , Thompson, D. A. , Amit, I. , … Regev, A. (2011). Full‐length transcriptome assembly from RNA‐Seq data without a reference genome. Nature Biotechnology, 29(7), 644–652. 10.1038/nbt.1883 - DOI - PMC - PubMed