SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines

Jérôme Audoux^{1

2}, Mikaël Salson³, Christophe F Grosset⁴, Sacha Beaumeunier^{1

2}, Jean-Marc Holder^{1

2}, Thérèse Commes^{1

2}, Nicolas Philippe^{5

6}

Affiliations

¹ SeqOne, IRMB, CHRU de Montpellier -Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France.
² Institute of Computational Biology, Montpellier, 860, Rue Saint-Priest, Montpellier Cedex 5, 34095, France.
³ University Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, F-59000, France.
⁴ University Bordeaux, Inserm, BMGIC, U1035, Bordeaux, 33076, France.
⁵ SeqOne, IRMB, CHRU de Montpellier -Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France. nphilippe.research@gmail.com.
⁶ Institute of Computational Biology, Montpellier, 860, Rue Saint-Priest, Montpellier Cedex 5, 34095, France. nphilippe.research@gmail.com.

PMID: 28969586
PMCID: PMC5623974
DOI: 10.1186/s12859-017-1831-5

SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines

Jérôme Audoux et al. BMC Bioinformatics. 2017.

. 2017 Sep 29;18(1):428.

doi: 10.1186/s12859-017-1831-5.

Authors

Jérôme Audoux^{1

2}, Mikaël Salson³, Christophe F Grosset⁴, Sacha Beaumeunier^{1

2}, Jean-Marc Holder^{1

2}, Thérèse Commes^{1

2}, Nicolas Philippe^{5

6}

Affiliations

¹ SeqOne, IRMB, CHRU de Montpellier -Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France.
² Institute of Computational Biology, Montpellier, 860, Rue Saint-Priest, Montpellier Cedex 5, 34095, France.
³ University Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, F-59000, France.
⁴ University Bordeaux, Inserm, BMGIC, U1035, Bordeaux, 33076, France.
⁵ SeqOne, IRMB, CHRU de Montpellier -Hopital St Eloi, 80 avenue Augustin Fliche, Montpellier, 34295, France. nphilippe.research@gmail.com.
⁶ Institute of Computational Biology, Montpellier, 860, Rue Saint-Priest, Montpellier Cedex 5, 34095, France. nphilippe.research@gmail.com.

PMID: 28969586
PMCID: PMC5623974
DOI: 10.1186/s12859-017-1831-5

Abstract

Background: The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices.

Results: To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biological conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biological question. We used these tools to simulate a real-world genomic medicine question s involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biological context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved.

Conclusion: Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biological question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biological question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biological question. We would like to see the creation of a reference corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus. SimBA software and data-set are available at http://cractools.gforge.inria.fr/softwares/simba/ .

Keywords: Benchmark; Pipeline optimization; RNA-Seq; Transcriptomics.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Overview of the SimBA benchmarking procedure. A benchmarking pipeline implemented with *SimBA* is composed of three components, i/ Simulation of synthetic data using *SimCT*, ii/ Processing of the synthetic data using a pipeline manager (i.e Snakemake [20], iii/ Qualitative evaluation of the results using *BenchCT*

**Fig. 2**
SimCT method. SimCT uses a reference FASTA and GTF annotations as input. A first process is intended to introduced biological variations in this reference to create a mutated reference. This new reference is then transfered to FluxSimulator, in order to generate an RNA-Seq experiment. Finaly FluxSimulator output are post-processed to transfer the coordinates from the mutated genome to the original reference

**Fig. 3**
BenchCT evaluation procedures. Each event is evaluated with benchCT with a specific procedure that allow approximate matching. For alignement, only overlap between the prediction and the truth is evaluated. For Splice junctions and Fusions we expect an overlap between the prediction and a candidate in the truth database with a limited agreement distance according to the threshold. For mutation (SNV and Indel), similar procedure is used, as well as the verification of the mutation. For SNVs we evaluate the mutated sequence and for insertions and deletions, the length of the mutation

**Fig. 4**
Precision and recall of SNV calling. a SNV precision/recall in *GRCh38-150bp-normal* data-set. b SNV detection in *GRCh38-150bp-somatic* data-set

**Fig. 5**
Precision and recall of indel calling. a Insertion precision/recall in *GRCh38-150bp-somatic*. b Intersections of true positives insertions found by calling pipelines in the *GRCh38-150bp-somatic* data-set

**Fig. 6**
Precision and recall of gene fusion detection. Evaluation of gene fusions detection pipelines on the *GRCh38-101bp-160-somatic* dataset. Fusions were splited in two category with an individual evaluation. a Colinear fusion where the fusion involves to genomic locations that are located on the same strand of the same chromosome with a distance superior to 300kb. b non-colinear fusions wich does not satisfy the *colinear* criteria

See this image and copyright information in PMC

Cited by

Challenges and best practices in omics benchmarking.
Brooks TG, Lahens NF, Mrčela A, Grant GR. Brooks TG, et al. Nat Rev Genet. 2024 May;25(5):326-339. doi: 10.1038/s41576-023-00679-6. Epub 2024 Jan 12. Nat Rev Genet. 2024. PMID: 38216661 Review.
Fusion InPipe, an integrative pipeline for gene fusion detection from RNA-seq data in acute pediatric leukemia.
Vicente-Garcés C, Maynou J, Fernández G, Esperanza-Cebollada E, Torrebadell M, Català A, Rives S, Camós M, Vega-García N. Vicente-Garcés C, et al. Front Mol Biosci. 2023 Jun 9;10:1141310. doi: 10.3389/fmolb.2023.1141310. eCollection 2023. Front Mol Biosci. 2023. PMID: 37363396 Free PMC article.
Mutation-Simulator: fine-grained simulation of random mutations in any genome.
Kühl MA, Stich B, Ries DC. Kühl MA, et al. Bioinformatics. 2021 May 1;37(4):568-569. doi: 10.1093/bioinformatics/btaa716. Bioinformatics. 2021. PMID: 32780803 Free PMC article.
BEERS2: RNA-Seq simulation through high fidelity in silico modeling.
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. Brooks TG, et al. Brief Bioinform. 2024 Mar 27;25(3):bbae164. doi: 10.1093/bib/bbae164. Brief Bioinform. 2024. PMID: 38605641 Free PMC article.

References

1. Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW. Translating RNA sequencing into clinical diagnostics: Opportunities and challenges. Nat Rev Genet. 2016;17(5):257–71. doi: 10.1038/nrg.2016.10. - DOI - PMC - PubMed
1. Seqc/Maqc-Iii Consortium A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14. doi: 10.1038/nbt.2957. - DOI - PMC - PubMed
1. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, Szcześniak MW, Gaffney DJ, Elo LL, Zhang X, Mortazavi A. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13. doi: 10.1186/s13059-016-0881-8. - DOI - PMC - PubMed
1. Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011;8(6):469–77. doi: 10.1038/nmeth.1613. - DOI - PubMed
1. Seo JS, Ju YS, Lee WC, Shin JY, Lee JK, Bleazard T, Lee J, Jung YJ, Kim JO, Shin JY, Yu SB, Kim J, Lee ER, Kang CH, Park IK, Rhee H, Lee SH, Kim JI, Kang JH, Kim YT. The transcriptional landscape and mutational profile of lung adenocarcinoma. Genome Res. 2012. doi:10.1101/gr.145144.112. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed