NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

Friederike Hanssen¹, Gisela Gabernet¹, Famke Bäuerle^{1

2

3

4}, Bianca Stöcker⁵, Felix Wiegand⁵, Nicholas H Smith⁶, Christian Mertes^{6

7

8}, Avirup Guha Neogi⁹, Leon Brandhoff^{9

10}, Anna Ossowski⁹, Janine Altmueller^{9

11

12}, Kerstin Becker⁹, Andreas Petzold¹³, Marc Sturm¹⁴, Tyll Stöcker¹⁵, Sugirthan Sivalingam¹⁶, Fabian Brand¹⁷, Axel Schmidt¹⁸, Andreas Buness¹⁹, Alexander J Probst²⁰, Susanne Motameny^{9

10}, Johannes Köster^{5

21}

Affiliations

¹ Quantitative Biology Center, Eberhard Karls University Tübingen, Tübingen, Germany.
² M3 Research Center, University Hospital, Tübingen, Germany.
³ Institute for Translational Bioinformatics, University Medical Center, Tübingen, Germany.
⁴ Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, Germany.
⁵ Bioinformatics and Computational Oncology, Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen, University of Duisburg-Essen, Essen, Germany.
⁶ TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
⁷ Munich Data Science Institute, Technical University of Munich, Munich, Germany.
⁸ Institute of Human Genetics, Klinikum rechts der Isar, School of Medicine, Technical University of Munich, Munich, Germany.
⁹ Cologne Center for Genomics, University of Cologne, Cologne, Germany.
¹⁰ West German Genome Center - Cologne, University of Cologne, Cologne, Germany.
¹¹ Core Facility Genomics, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
¹² Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
¹³ DRESDEN-concept Genome Center, TUD Dresden University of Technology, Dresden, Germany.
¹⁴ Institute of Medical Genetics and Applied Genomics, University Hospital Tuebingen, Tübingen, Germany.
¹⁵ Institute of Crop Science and Resource Conservation, University of Bonn, Bonn, Germany.
¹⁶ Institute of Human Genetics, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
¹⁷ Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany.
¹⁸ Institute of Human Genetics, University Hospital of Bonn, Bonn, Germany.
¹⁹ Core Unit for Bioinformatics Analysis, University Hospital Bonn, Bonn, Germany.
²⁰ Environmental Metagenomics, Research Center One Health Ruhr, University Alliance Ruhr, Faculty of Chemistry, University of Duisburg-Essen, Essen, Germany.
²¹ German Cancer Consortium, Essen, Germany.

PMID: 39345270
PMCID: PMC11428021
DOI: 10.12688/f1000research.140344.2

NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

Friederike Hanssen et al. F1000Res. 2024.

. 2024 Sep 12:12:1125.

doi: 10.12688/f1000research.140344.2. eCollection 2023.

Authors

Affiliations

¹ Quantitative Biology Center, Eberhard Karls University Tübingen, Tübingen, Germany.
² M3 Research Center, University Hospital, Tübingen, Germany.
³ Institute for Translational Bioinformatics, University Medical Center, Tübingen, Germany.
⁴ Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, Germany.
⁵ Bioinformatics and Computational Oncology, Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen, University of Duisburg-Essen, Essen, Germany.
⁶ TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
⁷ Munich Data Science Institute, Technical University of Munich, Munich, Germany.
⁸ Institute of Human Genetics, Klinikum rechts der Isar, School of Medicine, Technical University of Munich, Munich, Germany.
⁹ Cologne Center for Genomics, University of Cologne, Cologne, Germany.
¹⁰ West German Genome Center - Cologne, University of Cologne, Cologne, Germany.
¹¹ Core Facility Genomics, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
¹² Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Berlin, Germany.
¹³ DRESDEN-concept Genome Center, TUD Dresden University of Technology, Dresden, Germany.
¹⁴ Institute of Medical Genetics and Applied Genomics, University Hospital Tuebingen, Tübingen, Germany.
¹⁵ Institute of Crop Science and Resource Conservation, University of Bonn, Bonn, Germany.
¹⁶ Institute of Human Genetics, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.
¹⁷ Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University of Bonn, Bonn, Germany.
¹⁸ Institute of Human Genetics, University Hospital of Bonn, Bonn, Germany.
¹⁹ Core Unit for Bioinformatics Analysis, University Hospital Bonn, Bonn, Germany.
²⁰ Environmental Metagenomics, Research Center One Health Ruhr, University Alliance Ruhr, Faculty of Chemistry, University of Duisburg-Essen, Essen, Germany.
²¹ German Cancer Consortium, Essen, Germany.

PMID: 39345270
PMCID: PMC11428021
DOI: 10.12688/f1000research.140344.2

Abstract

We present the results of the human genomic small variant calling benchmarking initiative of the German Research Foundation (DFG) funded Next Generation Sequencing Competence Network (NGS-CN) and the German Human Genome-Phenome Archive (GHGA). In this effort, we developed NCBench, a continuous benchmarking platform for the evaluation of small genomic variant callsets in terms of recall, precision, and false positive/negative error patterns. NCBench is implemented as a continuously re-evaluated open-source repository. We show that it is possible to entirely rely on public free infrastructure (Github, Github Actions, Zenodo) in combination with established open-source tools. NCBench is agnostic of the used dataset and can evaluate an arbitrary number of given callsets, while reporting the results in a visual and interactive way. We used NCBench to evaluate over 40 callsets generated by various variant calling pipelines available in the participating groups that were run on three exome datasets from different enrichment kits and at different coverages. While all pipelines achieve high overall quality, subtle systematic differences between callers and datasets exist and are made apparent by NCBench.These insights are useful to improve existing pipelines and develop new workflows. NCBench is meant to be open for the contribution of any given callset. Most importantly, for authors, it will enable the omission of repeated re-implementation of paper-specific variant calling benchmarks for the publication of new tools or pipelines, while readers will benefit from being able to (continuously) observe the performance of tools and pipelines at the time of reading instead of at the time of writing.

Keywords: NGS; benchmarking; continuous; variant calling.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

**Figure 1.. Continuous evaluation and reporting workflow.**
Upon pull requests or pushes, a GitHub Actions workflow is triggered. This downloads data, runs the Snakemake-based evaluation pipeline, creates the Snakemake report and uploads it as an artifact. If the workflow is triggered on the main branch, its finalization triggers a second Github Actions workflow that builds and deploys the homepage at https://ncbench.github.io.

**Figure 2.. Exemplary screenshot of interactive tabular precision recall display.**
Each three rows display precision and recall together with underlying numbers and wrongly predicted genotypes stratified by read depth/coverage category. As provided via Datavzrd, every column can be selected for sorting, hidden, or searched (via the buttons next to the column names). In the interactive report, callset/pipeline names occur on the far left. Here, they have been removed since results can be expected to change over time. For actual results please see the always up-to-date interactive report at https://ncbench.github.io.

See this image and copyright information in PMC

References

1. Zook JM, Chapman B, Wang J, et al. : Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls. Nat. Biotechnol. Mar 2014;32(33):246–251. 10.1038/nbt.2835 - DOI - PubMed
1. Zook JM, Catoe D, McDaniel J, et al. : Ying Sheng, Karoline Bjarnesdatter Rypdal, and Marc Salit. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. Jun 2016;3(11):160025. 10.1038/sdata.2016.25 - DOI - PMC - PubMed
1. Eberle MA, Fritzilas E, Krusche P, et al. : A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. Jan 2017;27(1):157–164. 10.1101/gr.210500.116 - DOI - PMC - PubMed
1. Li H, Bloom JM, Farjoun Y, et al. : A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods. Aug 2018;15(88):595–597. 10.1038/s41592-018-0054-7 - DOI - PMC - PubMed
1. Wendell J, SAS Cary’s Russ Wolfinger, and MAQC As : Sequencing benchmarked.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- F1000 Research Ltd
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

Affiliations

NCBench: providing an open, reproducible, transparent, adaptable, and continuous benchmark approach for DNA-sequencing-based variant calling

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources