FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Andrea Sboner¹, Lukas Habegger, Dorothee Pflueger, Stephane Terry, David Z Chen, Joel S Rozowsky, Ashutosh K Tewari, Naoki Kitabayashi, Benjamin J Moss, Mark S Chee, Francesca Demichelis, Mark A Rubin, Mark B Gerstein

Affiliations

PMID: 20964841
PMCID: PMC3218660
DOI: 10.1186/gb-2010-11-10-r104

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Andrea Sboner et al. Genome Biol. 2010.

. 2010;11(10):R104.

doi: 10.1186/gb-2010-11-10-r104. Epub 2010 Oct 21.

Authors

Affiliation

¹ Program in Computational Biology and Bioinformatics, Yale University, 300 George Street, New Haven, CT 06511, USA. andrea.sboner@yale.edu

PMID: 20964841
PMCID: PMC3218660
DOI: 10.1186/gb-2010-11-10-r104

Abstract

We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.

PubMed Disclaimer

Figures

**Figure 1**
**Schematic of FusionSeq**. **(a)** The PE reads are processed to identify potential fusion candidates. Poor quality reads are discarded at first, and the remaining PE reads are aligned to the reference human genome (hg18). The reads are compared to the annotation set (UCSC Known Genes) in order to classify them as belonging to the same gene or to different genes. Those aligned to two different genes are then selected as potential fusion candidates. All good quality single-end reads are also stored for the identification of the sequence of the junction. **(b)** The filtration cascade module analyzes the candidates and removes those that have high sequence homology between the two genes or a higher insert size compared to the transcriptome norm. Additional filters are employed to remove candidates due to random pairing and misalignment as well as PCR artifacts and annotation inconsistencies. The high-confidence list of candidates is then scored and processed to find the sequence of the junction. **(c)**. The junction-sequence identifier detects the actual sequence at the breakpoints by constructing a fusion junction library. It first covers the regions of the potential breakpoint of each gene with 'tiles' 1 nt apart, and then creates all possible combinations, considering both orientation of the fusion, namely gene A upstream of gene B and *vice versa*. All single-end reads are then aligned to the fusion junction library and the junction with the highest support is identified as the sequence of the fusion transcript junction. *DASPER*, difference between the observed and analytically calculated expected SPER; *RESPER*, ratio of empirically computed SPERs; *SPER*, supportive PE reads.

**Figure 2**
**Abnormal insert-size principle applied to transcriptome data**. The composite model of a gene is created via the union of the exonic nucleotides from all its isoforms. By using the composite model, we can exploit the abnormal insert-size principle. A minimal fusion transcript fragment is created by connecting the regions of the two genes joined by PE reads. Subsequently, the insert-size of these chimeric PE reads is computed and compared to the insert-size distribution of PE reads in the normal transcriptome. The higher insert-size compared to the transcriptome norm would suggest an artifact since it may be due to the random joining of fragments during library generation.

**Figure 3**
**Filtration cascade module**. **(a)** The average percentage of candidates identified by the fusion detection module that are removed by each filter is reported. The labels also depict the order the filters have been applied in this case (counter-clockwise starting from the RepeatMasker filter), but it is worth noting that the order of the application of the filters does not affect the final list of candidates. **(b)** *RESPER* (ratio of empirically computed SPERs) versus depth of sequencing. The plot shows the *RESPER* values for *SLC45A3-ERG*, a real fusion transcript, and *P4HB-KLK3*, an artifact likely created by the random pairing due to the high expression of KLK3 at different sequencing depths.

**Figure 4**
**Results of FusionSeq**. **(a)** A subset of the PE reads connecting *TMPRSS2* and *ERG* are shown for four samples (106_T, NCI-H660, 1700_D, 580_B). **(b)** PE reads connecting *ERG* and *SLC45A3* for sample 2621_D. The outer circle reports all chromosomes, whereas the inset shows only the region of *ERG* and *SLC45A3*. The gray lines depict the intra-transcript PE reads, whereas the red ones represent the inter-transcript PE reads. Note that for illustration purposes, only the inter-transcript reads are shown for *SLC45A3*. The inset also depicts the composite model (blue line) and its exons (green boxes). **(c)** Results of the junction-sequence identifier. The location of the breakpoints for the four samples with the *TMPRSS2-ERG* fusion are reported as bars (not to scale). Moreover, the sequence of the junctions as well as a subset of the aligned reads for two samples is reported (106_T, 580_B). **(d)** The locations of the PCR primers used for the validation are depicted as red arrows. The isoforms consist of *TMPRSS2* and *ERG* exons fused to form different exon combinations as depicted schematically. For both samples NCI-H660 and 1700_D, isoform III is detected, whereas, for samples 106_T and 580_B, isoforms I and VI are determined, respectively (Table S7 in Additional file 1) [46,56]. The transcript isoforms were validated by a PCR assay for each sample separately (gel images). A 50-nt length standard (lane 1) is shown here for the determination of the approximate fragment size. The identity of the PCR products was validated by Sanger sequencing.

**Figure 5**
Expression values of the exons of *TMPRSS2* and *ERG*. The RPKM values computed on each exon of *ERG* (isoform NM_004449.4) and *TMPRSS2* (isoform NM_005656.3) are shown as stacked bars for the four samples with *TMPRSS2-ERG* fusion. For illustration purposes, the exons included in the most common fusion isoforms are labeled as 'FUSED'.

See this image and copyright information in PMC

References

1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The Transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441. - DOI - PMC - PubMed
1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
1. Hillier LW, Reinke V, Green P, Hirst M, Marra MA, Waterston RH. Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. Genome Res. 2009;19:657–666. doi: 10.1101/gr.088112.108. - DOI - PMC - PubMed
1. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo M. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. doi: 10.1126/science.1160342. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Affiliation

FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases