. 2018 Nov 6;19(1):188.

doi: 10.1186/s13059-018-1539-5.

Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

Anna Y Lee¹, Adam D Ewing^{2

3}, Kyle Ellrott^{2

4}, Yin Hu⁵, Kathleen E Houlahan¹, J Christopher Bare⁵, Shadrielle Melijah G Espiritu¹, Vincent Huang¹, Kristen Dang⁵, Zechen Chong^{6

7

8}, Cristian Caloian¹, Takafumi N Yamaguchi¹; ICGC-TCGA DREAM Somatic Mutation Calling Challenge Participants; Michael R Kellen⁵, Ken Chen⁶, Thea C Norman⁵, Stephen H Friend⁵, Justin Guinney⁵, Gustavo Stolovitzky⁹, David Haussler², Adam A Margolin^{10

11}, Joshua M Stuart¹², Paul C Boutros^{13

14

15}

Collaborators, Affiliations

Collaborators

ICGC-TCGA DREAM Somatic Mutation Calling Challenge Participants:
Bret D Barnes, Inanc Birol, Xiaoyu Chen, Readman Chiu, Anthony J Cox, Li Ding, Markus H-Y Fritz, Andrey Grigoriev, Faraz Hach, Joseph K Kawash, Jan O Korbel, Semyon Kruglyak, Yang Liao, Andrew McPherson, Ka Ming Nip, Tobias Rausch, S Cenk Sahinalp, Iman Sarrafi, Christopher T Saunders, Ole Schulz-Trieglaff, Richard Shaw, Wei Shi, Sean D Smith, Lei Song, Difei Wang, Kai Ye

Affiliations

¹ Ontario Institute for Cancer Research, Toronto, Ontario, Canada.
² Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
³ Mater Research Institute, University of Queensland, Woolloongabba, QLD, Australia.
⁴ Computational Biology Program, Oregon Health & Science University, Portland, OR, USA.
⁵ Sage Bionetworks, Seattle, WA, USA.
⁶ Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
⁷ Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA.
⁸ Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA.
⁹ IBM Computational Biology Center, T.J.Watson Research Center, Yorktown Heights, NY, USA.
¹⁰ Computational Biology Program, Oregon Health & Science University, Portland, OR, USA. adam.margolin@mssm.edu.
¹¹ Sage Bionetworks, Seattle, WA, USA. adam.margolin@mssm.edu.
¹² Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA. jstuart@ucsc.edu.
¹³ Ontario Institute for Cancer Research, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.
¹⁴ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.
¹⁵ Department of Pharmacology and Toxicology, University of Toronto, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.

PMID: 30400818
PMCID: PMC6219177
DOI: 10.1186/s13059-018-1539-5

Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

Anna Y Lee et al. Genome Biol. 2018.

. 2018 Nov 6;19(1):188.

doi: 10.1186/s13059-018-1539-5.

Authors

Collaborators

ICGC-TCGA DREAM Somatic Mutation Calling Challenge Participants:
Bret D Barnes, Inanc Birol, Xiaoyu Chen, Readman Chiu, Anthony J Cox, Li Ding, Markus H-Y Fritz, Andrey Grigoriev, Faraz Hach, Joseph K Kawash, Jan O Korbel, Semyon Kruglyak, Yang Liao, Andrew McPherson, Ka Ming Nip, Tobias Rausch, S Cenk Sahinalp, Iman Sarrafi, Christopher T Saunders, Ole Schulz-Trieglaff, Richard Shaw, Wei Shi, Sean D Smith, Lei Song, Difei Wang, Kai Ye

Affiliations

¹ Ontario Institute for Cancer Research, Toronto, Ontario, Canada.
² Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
³ Mater Research Institute, University of Queensland, Woolloongabba, QLD, Australia.
⁴ Computational Biology Program, Oregon Health & Science University, Portland, OR, USA.
⁵ Sage Bionetworks, Seattle, WA, USA.
⁶ Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX, USA.
⁷ Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA.
⁸ Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, USA.
⁹ IBM Computational Biology Center, T.J.Watson Research Center, Yorktown Heights, NY, USA.
¹⁰ Computational Biology Program, Oregon Health & Science University, Portland, OR, USA. adam.margolin@mssm.edu.
¹¹ Sage Bionetworks, Seattle, WA, USA. adam.margolin@mssm.edu.
¹² Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA. jstuart@ucsc.edu.
¹³ Ontario Institute for Cancer Research, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.
¹⁴ Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.
¹⁵ Department of Pharmacology and Toxicology, University of Toronto, Toronto, Ontario, Canada. paul.boutros@oicr.on.ca.

PMID: 30400818
PMCID: PMC6219177
DOI: 10.1186/s13059-018-1539-5

Abstract

Background: The phenotypes of cancer cells are driven in part by somatic structural variants. Structural variants can initiate tumors, enhance their aggressiveness, and provide unique therapeutic opportunities. Whole-genome sequencing of tumors can allow exhaustive identification of the specific structural variants present in an individual cancer, facilitating both clinical diagnostics and the discovery of novel mutagenic mechanisms. A plethora of somatic structural variant detection algorithms have been created to enable these discoveries; however, there are no systematic benchmarks of them. Rigorous performance evaluation of somatic structural variant detection methods has been challenged by the lack of gold standards, extensive resource requirements, and difficulties arising from the need to share personal genomic information.

Results: To facilitate structural variant detection algorithm evaluations, we create a robust simulation framework for somatic structural variants by extending the BAMSurgeon algorithm. We then organize and enable a crowdsourced benchmarking within the ICGC-TCGA DREAM Somatic Mutation Calling Challenge (SMC-DNA). We report here the results of structural variant benchmarking on three different tumors, comprising 204 submissions from 15 teams. In addition to ranking methods, we identify characteristic error profiles of individual algorithms and general trends across them. Surprisingly, we find that ensembles of analysis pipelines do not always outperform the best individual method, indicating a need for new ways to aggregate somatic structural variant detection approaches.

Conclusions: The synthetic tumors and somatic structural variant detection leaderboards remain available as a community benchmarking resource, and BAMSurgeon is available at https://github.com/adamewing/bamsurgeon .

Keywords: Benchmarking; Cancer genomics; Crowdsourcing; Simulation; Somatic mutations; Structural variants; Whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
BAMSurgeon simulates SVs in genome sequences. Method for adding SVs to existing BAM alignments. a Overview of SV (e.g., deletion) spike-in: Starting with an original BAM (i), a region (ii) is selected where a deletion is desired. (iii) Contigs are assembled from reads in the selected region, and the contig is rearranged by deleting the middle. The amount of contig deleted is a user-definable parameter. Read coverage is generated over the contig using wgsim to match the number of reads per base in the original BAM. Since the deletion contig is shorter than the original, fewer reads will be required to achieve the equivalent coverage. (iv) Generated read pairs include discordant pairs (i.e., paired reads that do not align to the reference genome with the expected relative orientation and inner distance) spanning the deletion and clipped reads (i.e., reads that are only partially aligned to the reference). Reads mapping to the deleted region of the contig are not included in the final BAM. b, c To test the robustness of BAMSurgeon with respect to changes in (b) aligner and (c) cell line, we compared the ranks of CREST, Delly, Manta, and novoBreak on two new tumor-normal datasets: one with an alternative aligner, NovoAlign, and the other on an alternative cell line, HCC1954 BL. Callers were scored with f = 100 bp (Additional file 1: Figure S2b); Manta retained the top position, independent of aligner and cell line. d Summary of the three in silico (IS) tumors described here. Abbreviations: DEL, deletion; DUP, duplication; INV, inversion; INS, insertion

**Fig. 2**
Overview of the SV Calling Challenge submissions. a Precision-recall plot of IS1 submissions. Each point represents a submission, each color represents a team and the best submission from each team (top F-score) is circled. The “Standard” point corresponds to the reference point submission provided by Challenge organizers. b The F-scores of submissions on the training and testing sets are highly correlated for IS1 (Spearman’s ρ = 0.98), falling near the plotted y = x line

**Fig. 3**
Performance optimization by parameterization and ensembles. a Recall, precision, and F-score of all IS1 submissions plotted by team, then submission order. Teams were ranked by the F-score of their best submission, color coding (top bar) as in Fig. 2. The “Standard’” lines correspond to the reference point submission provided by Challenge organizers. b For each tumor, the improvement in F-score from the initial (“naive”) to the best (“optimized”) submissions of each team. Darker shades of blue indicate greater improvement. c For each tumor, team rankings based on their naive or optimized submissions. Larger dot sizes indicate better ranks by F-score. b, c An “X” indicates that the team did not make a submission for the specific tumor (or changed team name). d Recall, precision, and F-score of ensembles versus individual submissions for IS1. At the kth rank, the triangles indicate performance of the ensemble of the top k submissions, and the circles indicate performance of the kth ranked submission. The ensemble analysis focused on the best submission from each team

**Fig. 4**
Characteristics of prediction errors. Random forests assess the importance of 16 sequence-based variables for each caller’s FN (a, c, e, g, i) and FP (b, d, f, h, j) breakpoints. Each panel shows variable importance on the left, where each row represents the best performing set of predictions by the given team/caller (on the given in silico tumor), and each column represents the indicated variable. Dot size reflects variable importance, i.e., the mean change in accuracy caused by removing the variable from the model (generated to predict erroneous breakpoints). Color reflects the directional effect of each variable (red and blue for greater and lower variable values, respectively, associated with erroneous breakpoints; black for categorical variables or insignificant directional associations, two-sided Mann-Whitney P > 0.01). Background shading indicates the accuracy of the model (see the color bar). Variable importance for FN and FP breakpoints in each of the three tumors is shown for the following SV callers: CREST (a, b), Delly (c, d), and Manta (e, f). Manta only called two FPs in IS1; thus, variable importance for FP breakpoints could not be computed (indicated by Xs in the plot). Variable importance for FN and FP breakpoints in IS2 (g, h) and IS3 (i, j) is shown for each team. In the right plot (g–j), the first four columns indicate usage of the indicated algorithmic approaches by each team, and the last column indicates the aligner used. Gray indicates that algorithmic approaches and aligner are unknown for the given team. Abbreviations: Algm, algorithm; SNP, single-nucleotide polymorphism; INDEL, short insertion or deletion

See this image and copyright information in PMC

References

1. Northcott PA, Lee C, Zichner T, Stütz AM, Erkek S, Kawauchi D, et al. Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma. Nature. 2014;511:428–434. doi: 10.1038/nature13379. - DOI - PMC - PubMed
1. Taub R, Kirsch I, Morton C, Lenoir G, Swan D, Tronick S, et al. Translocation of the c-myc gene into the immunoglobulin heavy chain locus in human Burkitt lymphoma and murine plasmacytoma cells. Proc Natl Acad Sci U S A. 1982;79:7837–7841. doi: 10.1073/pnas.79.24.7837. - DOI - PMC - PubMed
1. Huang M, Ye Y, Chen S, Chai J, Lu J, Zhoa L, et al. Use of all-trans retinoic acid in the treatment of acute promyelocytic leukemia. Blood. 1988;72:567–572. - PubMed
1. Lalonde E, Ishkanian AS, Sykes J, Fraser M, Ross-Adams H, Erho N, et al. Tumour genomic and microenvironmental heterogeneity for integrated prediction of 5-year biochemical recurrence of prostate cancer: a retrospective cohort study. Lancet Oncol. 2014;15:1521–1532. doi: 10.1016/S1470-2045(14)71021-6. - DOI - PubMed
1. Vollan HKM, Rueda OM, Chin S-F, Curtis C, Turashvili G, Shah S, et al. A tumor DNA complex aberration index is an independent predictor of survival in breast and ovarian cancer. Mol Oncologia. 2015;9:115–127. doi: 10.1016/j.molonc.2014.07.019. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U24 CA143858/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

Collaborators

Affiliations

Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources