Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 3;18(1):7.
doi: 10.1186/s12864-016-3404-9.

ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data

Affiliations

ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data

Bernardo Rodríguez-Martín et al. BMC Genomics. .

Abstract

Background: Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment.

Results: Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved.

Conclusions: ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy.

Keywords: Benchmark; Cancer; Chimera; Fusion gene; Isoform; RNA-seq; Simulation; Splice junction; Transcript.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Two types of RNA-seq reads for chimera detection. This picture shows a chimeric transcript (bottom) made from exons of two genes, A and B, depicted in blue and red respectively (top). This chimeric transcript is supported by two types of reads: a split-read and a discordant paired-end read, that we depict aligned both on the genome (middle-top) and on the transcriptome (middle-bottom). The chimeric junction position on the transcriptome is highlighted by a yellow star both in the split-read and in the chimeric transcript
Fig. 2
Fig. 2
The ChimPipe method. a RNA-seq reads are first mapped to the genome and transcriptome using the GEMtools RNA-seq pipeline, and the reads that do not map this way are passed to the GEM RNA-mapper to get reads that split map to different chromosomes or strands. b The split-reads from these two mapping steps are then gathered and passed on to the ChimSplice module which derives consensus junctions associated to their expression calculated as the number of staggered split-reads supporting them. The ChimPE module can then associate each chimeric junction found by ChimSplice to their discordant PE reads, splitting them into the ones consistent and the ones inconsistent with the junction. c The ChimFilter module then applies a series of filters to the chimeric junctions obtained until this point in order to discard false positives, leading to d a set of reliable chimeric junctions to which it associates several pieces of information such as a category (readthrough, intrachromosomal, inverted, interstand, or interchromosomal), and the supporting evidence in terms of number of staggered split-reads and number of consistent PE reads, among others
Fig. 3
Fig. 3
Benchmark results for 5 chimera detection programs on simulated (left) and on real (right) data. The sets of barplots on the top a, b indicate the programs’ performances at the gene pair level, while the sets of barplots at the bottom c, d indicate the programs’ performances at the junction level. For simulated data the provided measures are sensitivity (in red), precision (in blue), and F1score (in green), while for the two real datasets (Berger in red and Edgren in blue), the only provided measures are sensitivity (bars) and the total number of predictions (at the top of each bar). Here we show the results on PE76 simulated data, for the 250 simulated chimeric junctions (i.e. including read-through events). For the benchmark on real data, read-through events, i.e. junctions with a length smaller than 100kb when on the same chromosome, same strand and expected genomic order, were removed from the output of each program before the evaluation
Fig. 4
Fig. 4
Distance between predicted and true junction. For the PE76 simulated set a, the Berger cancer dataset b and the Edgren cancer dataset c, and for each chimera detection program, the distance between the reference/true junction and the junction predicted by the program is plotted in log scale and using a pseudocount of 1 to avoid zero values. The distance between two junctions is defined as the sum of the distance between their donor/upstream/5’ splice sites and the distance between their acceptor/downstream/3’ splice sites
Fig. 5
Fig. 5
Chimeric gene pairs predicted by the 5 programs on the two real datasets. Intersection between chimeric gene pairs predicted by the 5 programs on the Berger set a and on the Edgren set b are represented as Venn diagrams. In general gene pairs predicted by all 5 programs are few compared to the gene pairs predicted by a single program, and we expect that the higher the number of programs predicting a gene pair the more reliable the gene pair. Chimerascan and TophaFusion are the programs that predict more gene pairs predicted by no other program, while PRADA, Chimpipe and FusionMap are the programs with less such gene pairs. CP: ChimPipe, FM: FusionMap, PR: PRADA, CS: Chimerascan, THF: TopHatFusion
Fig. 6
Fig. 6
UBA2-WTIP chimeric transcript isoforms. a Experimentally validated UBA2-WTIP chimeric transcript isoforms. (Top) UBA2 and WTIP parent transcripts according to RefSeq version 74. Coding and UTR exonic sequences are displayed as thick and thin boxes, respectively, and introns as lines. The genomic strand of the transcripts is represented as an arrow on the 5’ end (Bottom) Chimeric RNAs with chimeric splice junctions are depicted as yellow dashed lines. On the left, list of cancer cell lines where each isoform was validated b UBA2-WTIP chimeric splice junction validation (Left) Primer design for validating the chimeric junction through RT-PCR plus Sanger sequencing. (Right) Chimeric junction validation in 4 different cell lines. The 72 bp amplicons proving the expression of the chimeric RNAs are highlighted in red. c UBA2-WTIP Q1 isoform protein coding potential. (Top) UBA2 and WTIP annotated start and stop codons represented over the transcript sequence. (Bottom) ORFs in the six possible frames. The selected ORF from the UBA2 annotated start codon to the WTIP annotated stop codon is highlighted in dark yellow. d Putative chimeric protein encoded by the UBA2-WTIP Q1 isoform. (Top) UBA2 and WTIP wild-type proteins. The exact position of the two protein breakpoints is indicated by yellow stars. Protein domains are depicted as boxes and triangles over the protein sequences. Thin boxes on the WTIP protein sequence correspond to low complexity regions. The x axis shows the amino acid position along the protein sequence. (Bottom) Putative UBA2-WTIP chimeric protein. Full-length domains are represented over the protein sequence. e The predicted 3D structure of the UBA2-WTIP chimeric protein as modelled by Phyre2 [51]. The chimeric protein part derived from UBA2 is depicted in blue and the one derived from WTIP in red

References

    1. Gingeras TR. Implications of chimaeric non-co-linear transcripts. Nature. 2009;461:206–11. doi: 10.1038/nature08452. - DOI - PMC - PubMed
    1. Mitelman F, Johansson B, Mertens F. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer. 2007;7:233–45. doi: 10.1038/nrc2091. - DOI - PubMed
    1. Akiva P, Toporik A, Edelheit S, Peretz Y, Diber A, Shemesh R, et al. Transcription-mediated gene fusion in the human genome. Genome Res. 2006;16:30–6. doi: 10.1101/gr.4137606. - DOI - PMC - PubMed
    1. Parra G, Reymond A, Dabbouseh N, Dermitzakis ET, Castelo R, Thomson TM, et al. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. 2006;16:37–44. doi: 10.1101/gr.4145906. - DOI - PMC - PubMed
    1. Unneberg P, Claverie JM. Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data. PLoS ONE. 2007;2:e254. doi: 10.1371/journal.pone.0000254. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources