Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010;11(10):R104.
doi: 10.1186/gb-2010-11-10-r104. Epub 2010 Oct 21.

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Affiliations

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Andrea Sboner et al. Genome Biol. 2010.

Abstract

We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of FusionSeq. (a) The PE reads are processed to identify potential fusion candidates. Poor quality reads are discarded at first, and the remaining PE reads are aligned to the reference human genome (hg18). The reads are compared to the annotation set (UCSC Known Genes) in order to classify them as belonging to the same gene or to different genes. Those aligned to two different genes are then selected as potential fusion candidates. All good quality single-end reads are also stored for the identification of the sequence of the junction. (b) The filtration cascade module analyzes the candidates and removes those that have high sequence homology between the two genes or a higher insert size compared to the transcriptome norm. Additional filters are employed to remove candidates due to random pairing and misalignment as well as PCR artifacts and annotation inconsistencies. The high-confidence list of candidates is then scored and processed to find the sequence of the junction. (c). The junction-sequence identifier detects the actual sequence at the breakpoints by constructing a fusion junction library. It first covers the regions of the potential breakpoint of each gene with 'tiles' 1 nt apart, and then creates all possible combinations, considering both orientation of the fusion, namely gene A upstream of gene B and vice versa. All single-end reads are then aligned to the fusion junction library and the junction with the highest support is identified as the sequence of the fusion transcript junction. DASPER, difference between the observed and analytically calculated expected SPER; RESPER, ratio of empirically computed SPERs; SPER, supportive PE reads.
Figure 2
Figure 2
Abnormal insert-size principle applied to transcriptome data. The composite model of a gene is created via the union of the exonic nucleotides from all its isoforms. By using the composite model, we can exploit the abnormal insert-size principle. A minimal fusion transcript fragment is created by connecting the regions of the two genes joined by PE reads. Subsequently, the insert-size of these chimeric PE reads is computed and compared to the insert-size distribution of PE reads in the normal transcriptome. The higher insert-size compared to the transcriptome norm would suggest an artifact since it may be due to the random joining of fragments during library generation.
Figure 3
Figure 3
Filtration cascade module. (a) The average percentage of candidates identified by the fusion detection module that are removed by each filter is reported. The labels also depict the order the filters have been applied in this case (counter-clockwise starting from the RepeatMasker filter), but it is worth noting that the order of the application of the filters does not affect the final list of candidates. (b) RESPER (ratio of empirically computed SPERs) versus depth of sequencing. The plot shows the RESPER values for SLC45A3-ERG, a real fusion transcript, and P4HB-KLK3, an artifact likely created by the random pairing due to the high expression of KLK3 at different sequencing depths.
Figure 4
Figure 4
Results of FusionSeq. (a) A subset of the PE reads connecting TMPRSS2 and ERG are shown for four samples (106_T, NCI-H660, 1700_D, 580_B). (b) PE reads connecting ERG and SLC45A3 for sample 2621_D. The outer circle reports all chromosomes, whereas the inset shows only the region of ERG and SLC45A3. The gray lines depict the intra-transcript PE reads, whereas the red ones represent the inter-transcript PE reads. Note that for illustration purposes, only the inter-transcript reads are shown for SLC45A3. The inset also depicts the composite model (blue line) and its exons (green boxes). (c) Results of the junction-sequence identifier. The location of the breakpoints for the four samples with the TMPRSS2-ERG fusion are reported as bars (not to scale). Moreover, the sequence of the junctions as well as a subset of the aligned reads for two samples is reported (106_T, 580_B). (d) The locations of the PCR primers used for the validation are depicted as red arrows. The isoforms consist of TMPRSS2 and ERG exons fused to form different exon combinations as depicted schematically. For both samples NCI-H660 and 1700_D, isoform III is detected, whereas, for samples 106_T and 580_B, isoforms I and VI are determined, respectively (Table S7 in Additional file 1) [46,56]. The transcript isoforms were validated by a PCR assay for each sample separately (gel images). A 50-nt length standard (lane 1) is shown here for the determination of the approximate fragment size. The identity of the PCR products was validated by Sanger sequencing.
Figure 5
Figure 5
Expression values of the exons of TMPRSS2 and ERG. The RPKM values computed on each exon of ERG (isoform NM_004449.4) and TMPRSS2 (isoform NM_005656.3) are shown as stacked bars for the four samples with TMPRSS2-ERG fusion. For illustration purposes, the exons included in the most common fusion isoforms are labeled as 'FUSED'.

References

    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The Transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441. - DOI - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Hillier LW, Reinke V, Green P, Hirst M, Marra MA, Waterston RH. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009;19:657–666. doi: 10.1101/gr.088112.108. - DOI - PMC - PubMed
    1. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo M. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. - DOI - PubMed

Publication types