Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 18;12(8):827-838.e5.
doi: 10.1016/j.cels.2021.05.021. Epub 2021 Jun 18.

A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery

Collaborators, Affiliations

A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery

Allison Creason et al. Cell Syst. .

Abstract

The accurate identification and quantitation of RNA isoforms present in the cancer transcriptome is key for analyses ranging from the inference of the impacts of somatic variants to pathway analysis to biomarker development and subtype discovery. The ICGC-TCGA DREAM Somatic Mutation Calling in RNA (SMC-RNA) challenge was a crowd-sourced effort to benchmark methods for RNA isoform quantification and fusion detection from bulk cancer RNA sequencing (RNA-seq) data. It concluded in 2018 with a comparison of 77 fusion detection entries and 65 isoform quantification entries on 51 synthetic tumors and 32 cell lines with spiked-in fusion constructs. We report the entries used to build this benchmark, the leaderboard results, and the experimental features associated with the accurate prediction of RNA species. This challenge required submissions to be in the form of containerized workflows, meaning each of the entries described is easily reusable through CWL and Docker containers at https://github.com/SMC-RNA-challenge. A record of this paper's transparent peer review process is included in the supplemental information.

Keywords: Cancer; Cloud compute; DREAM Challenge; RNA fusion; RNA-seq; benchmark; crowd-sourced; evaluation; isoform quantification.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Overview of the challenge
(A–C) The challenge generated simulated (or in silico) and spike-in datasets represented as RNA-seq reads (FastQ files) and ground truth. Challenge participants could submit entries (i.e., CWL workflows and Docker images) as individuals or teams using Synapse. Submitted entries were run on the FastQ files using cloud-based compute resources to generate predictions. The resulting predictions were evaluated based on statistical performance measurements. Evaluation of the Fusion Detection sub-challenge (B) used four types of input datasets to calculate sensitivity and either precision or the total number of fusion calls. Datasets where the fusion genes are known are represented as red (5’ donor) and blue (3’ acceptor), and datasets where unknown fusion genes may exist are represented as light and dark gray. The confusion matrix displays the known (green), unknown (red), and irrelevant (gray) parameters used to calculate the subsequent statistical metrics. Evaluation of the isoform quantification sub-challenge (C) used two metrics for evaluating the correlation of predictions to the truth. The transcriptome-wise evaluation compared predictions and truth in a single sample across all transcripts using a Spearman correlation. The sample-wise evaluation compared predictions and truth for a single transcript across multiple sample replicates using Kendall’s tau-β.
Figure 2.
Figure 2.. Boruta feature importance analysis across by fusion submissions
(A–D) A heatmap showing results from performing the Boruta algorithm on each submission’s false-positive fusion events (A) and false-negative fusion events (B). Each cell in the heatmap represents the Z score mean decrease in accuracy. Higher Z scores are in red and represent more important features. Rows are the fusion submission names and columns are the features. Only features that had a mean value greater than Boruta’s shadow maximum value are shown. Boxplots showing results from performing the Boruta algorithm on all Fusion Detection sub-challenge submissions. (C) is the importance analysis against false positives and (D) is against the false negatives. The y axis represents the Z score MDA and features are across the x axis. The red plots are the Z scores of the actual features and blue are Boruta’s shadow features, which are considered the randomized background features. Only features that performed better (p < 0.05) than the random features are shown in this plot.
Figure 3.
Figure 3.. Isoform abundance Kendall Tau-β correlation coefficient bootstrap
(A–C) Ranking of methods based on their performance in predicting isoform levels as measured by 1,000 bootstrap replicates of the Kendall Tau-β score (KTBS) (see STAR methods). The x axis represents the submissions and the y axis the KTBS. Each boxplot represents the 1,000 mean Tau-β scores for each bootstrap. Results of the Student’s t test for closely ranked submissions shown between boxplots. Values greater than 0.05 were considered as ties between submissions. (B and C) Kendall’s tau-β correlation by transcript and submission method. Plots show Kendall’s tau-β correlation coefficient for each transcript with Submission ID across the x axis (B) or transcript across the x axis (C). The color corresponds to the feature in the legend.

References

    1. Abugessaisa I, Noguchi S, Carninci P, and Kasukawa T. (2017). The FANTOM5 computation ecosystem: genomic information hub for promoters and active enhancers. Methods Mol. Biol. 1611, 199–217. - PubMed
    1. Ahsen ME, Vogel R, and Stolovitzky G. (2018). Unsupervised evaluation and weighted aggregation of ranked predictions. J. Mach. Learn. Res. 20, 1–40.
    1. Bao W, Kojima KK, and Kohany O. (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11. - PMC - PubMed
    1. Bray NL, Pimentel H, Melsted P, and Pachter L. (2016). Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527. - PubMed
    1. Chen S, Huang V, Xu X, Livingstone J, Soares F, Jeon J, Zeng Y, Hua JT, Petricca J, Guo H, et al. (2019). Widespread and functional RNA circularization in localized prostate. Cancer Cell 176, 831–843. - PubMed

Publication types