. 2021 Sep;39(9):1151-1160.

doi: 10.1038/s41587-021-00993-6. Epub 2021 Sep 9.

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Li Tai Fang^#¹, Bin Zhu^#², Yongmei Zhao^#³, Wanqiu Chen⁴, Zhaowei Yang^{4

5}, Liz Kerrigan⁶, Kurt Langenbach⁶, Maryellen de Mars⁶, Charles Lu⁷, Kenneth Idler⁷, Howard Jacob⁷, Yuanting Zheng⁸, Luyao Ren⁸, Ying Yu⁸, Erich Jaeger⁹, Gary P Schroth⁹, Ogan D Abaan⁹, Keyur Talsania³, Justin Lack³, Tsai-Wei Shen³, Zhong Chen⁴, Seta Stanbouly⁴, Bao Tran¹⁰, Jyoti Shetty¹⁰, Yuliya Kriga¹⁰, Daoud Meerzaman¹¹, Cu Nguyen¹¹, Virginie Petitjean¹², Marc Sultan¹², Margaret Cam¹³, Monika Mehta¹⁰, Tiffany Hung¹⁴, Eric Peters¹⁴, Rasika Kalamegham¹⁴, Sayed Mohammad Ebrahim Sahraeian¹, Marghoob Mohiyuddin¹, Yunfei Guo¹, Lijing Yao¹, Lei Song², Hugo Y K Lam¹, Jiri Drabek^{15

16}, Petr Vojta^{15

16}, Roberta Maestro^{16

17}, Daniela Gasparotto^{16

17}, Sulev Kõks^{16

18

19}, Ene Reimann^{16

19}, Andreas Scherer^{16

20}, Jessica Nordlund^{16

21}, Ulrika Liljedahl^{16

21}, Roderick V Jensen²², Mehdi Pirooznia²³, Zhipan Li²⁴, Chunlin Xiao²⁵, Stephen T Sherry²⁵, Rebecca Kusko²⁶, Malcolm Moos²⁷, Eric Donaldson²⁸, Zivana Tezak²⁹, Baitang Ning³⁰, Weida Tong³⁰, Jing Li⁵, Penelope Duerken-Hughes³¹, Claudia Catalanotti³², Shamoni Maheshwari³², Joe Shuga³², Winnie S Liang³³, Jonathan Keats³³, Jonathan Adkins³³, Erica Tassone³³, Victoria Zismann³³, Timothy McDaniel³³, Jeffrey Trent³³, Jonathan Foox³⁴, Daniel Butler³⁴, Christopher E Mason³⁴, Huixiao Hong³⁵, Leming Shi³⁶, Charles Wang^{37

38}, Wenming Xiao³⁹; Somatic Mutation Working Group of Sequencing Quality Control Phase II Consortium

Collaborators, Affiliations

Collaborators

Somatic Mutation Working Group of Sequencing Quality Control Phase II Consortium:
Ogan D Abaan, Meredith Ashby, Ozan Aygun, Xiaopeng Bian, Thomas M Blomquist, Pierre Bushel, Margaret Cam, Fabien Campagne, Qingrong Chen, Tao Chen, Xin Chen, Yun-Ching Chen, Han-Yu Chuang, Maryellen de Mars, Youping Deng, Eric Donaldson, Jiri Drabek, Ben Ernest, Jonathan Foox, Don Freed, Paul Giresi, Ping Gong, Ana Granat, Meijian Guan, Yan Guo, Christos Hatzis, Susan Hester, Jennifer A Hipp, Huixiao Hong, Tiffany Hung, Kenneth Idler, Howard Jacob, Erich Jaeger, Parthav Jailwala, Roderick V Jensen, Wendell Jones, Rasika Kalamegham, Bindu Kanakamedala, Jonathan Keats, Liz Kerrigan, Sulev Kõks, Yuliya Kriga, Rebecca Kusko, Samir Lababidi, Kurt Langenbach, Eunice Lee, Jian-Liang Li, You Li, Zhipan Li, Sharon Xueying Liang, Xuelu Liu, Charles Lu, Roberta Maestro, Christopher E Mason, Tim McDaniel, Timothy Mercer, Daoud Meerzaman, Urvashi Mehra, Corey Miles, Chris Miller, Malcolm Moos, Ali Moshrefi, Aparna Natarajan, Baitang Ning, Jessica Nordlund, Cu Nguyen, Jai Pandey, Brian N Papas, Anand Pathak, Eric Peters, Virginie Petitjean, Mehdi Pirooznia, Maurizio Polano, Arati Raziuddin, Wolfgang Resch, Luyao Ren, Andreas Scherer, Gary P Schroth, Fayaz Seifuddin, Steve T Sherry, Jyoti Shetty, Leming Shi, Tieliu Shi, Louis M Staudt, Marc Sultan, Zivana Tezak, Weida Tong, Bao Tran, Jeff Trent, Tiffany Truong, Petr Vojta, Cristobal Juan Vera, Ashley Walton, Charles Wang, Jing Wang, Jingya Wang, Mingyi Wang, James C Willey, Leihong Wu, Chunlin Xiao, Wenming Xiao, Xiaojian Xu, Chunhua Yan, Gokhan Yavas, Ying Yu, Chaoyang Zhang, Yuanting Zheng

Affiliations

¹ Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., Belmont, CA, USA.
² Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
³ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁴ Center for Genomics, Loma Linda University School of Medicine, Loma Linda, CA, USA.
⁵ Department of Allergy and Clinical Immunology, State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
⁶ ATCC (American Type Culture Collection), Manassas, VA, USA.
⁷ Computational Genomics, Genomics Research Center (GRC), AbbVie, North Chicago, IL, USA.
⁸ State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China.
⁹ Illumina Inc., Foster City, CA, USA.
¹⁰ Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
¹¹ Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology (CBIIT), National Cancer Institute, Rockville, MD, USA.
¹² Biomarker Development, Novartis Institutes for Biomedical Research, Basel, Switzerland.
¹³ CCR Collaborative Bioinformatics Resource (CCBR), Office of Science and Technology Resources, Center for Cancer Research, Bethesda, MD, USA.
¹⁴ Genentech, a member of the Roche group, South San Francisco, CA, USA.
¹⁵ IMTM, Faculty of Medicine and Dentistry, Palacky University, Olomouc, Czech Republic.
¹⁶ European Infrastructure for Translational Medicine, Amsterdam, the Netherlands.
¹⁷ Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of Oncogenetics and Functional Oncogenomics, Aviano, Italy.
¹⁸ Perron Institute for Neurological and Translational Science, Nedlands, Western Australia, Australia.
¹⁹ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
²⁰ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
²¹ Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
²² Department of Biological Sciences, Virginia Tech, Blacksburg, VA, USA.
²³ Bioinformatics and Computational Biology Core, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA.
²⁴ Sentieon Inc., Mountain View, CA, USA.
²⁵ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
²⁶ Immuneering Corporation, Boston, MA, USA.
²⁷ Center for Biologics Evaluation and Research, FDA, Silver Spring, MD, USA.
²⁸ Center for Drug Evaluation and Research, FDA, Silver Spring, MD, USA.
²⁹ Center for Devices and Radiological Health, FDA, Silver Spring, MD, USA.
³⁰ National Center for Toxicological Research, FDA, Jefferson, AR, USA.
³¹ Department of Basic Science, Loma Linda University School of Medicine, Loma Linda, CA, USA.
³² 10x Genomics, Pleasanton, CA, USA.
³³ Translational Genomics Research Institute, Phoenix, AZ, USA.
³⁴ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
³⁵ National Center for Toxicological Research, FDA, Jefferson, AR, USA. huixiao.hong@fda.hhs.gov.
³⁶ State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China. lemingshi@fudan.edu.cn.
³⁷ Center for Genomics, Loma Linda University School of Medicine, Loma Linda, CA, USA. oxwang@gmail.com.
³⁸ Department of Basic Science, Loma Linda University School of Medicine, Loma Linda, CA, USA. oxwang@gmail.com.
³⁹ Center for Devices and Radiological Health, FDA, Silver Spring, MD, USA. wenming.xiao@fda.hhs.gov.

^# Contributed equally.

PMID: 34504347
PMCID: PMC8532138
DOI: 10.1038/s41587-021-00993-6

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Li Tai Fang et al. Nat Biotechnol. 2021 Sep.

. 2021 Sep;39(9):1151-1160.

doi: 10.1038/s41587-021-00993-6. Epub 2021 Sep 9.

Authors

Collaborators

Somatic Mutation Working Group of Sequencing Quality Control Phase II Consortium:
Ogan D Abaan, Meredith Ashby, Ozan Aygun, Xiaopeng Bian, Thomas M Blomquist, Pierre Bushel, Margaret Cam, Fabien Campagne, Qingrong Chen, Tao Chen, Xin Chen, Yun-Ching Chen, Han-Yu Chuang, Maryellen de Mars, Youping Deng, Eric Donaldson, Jiri Drabek, Ben Ernest, Jonathan Foox, Don Freed, Paul Giresi, Ping Gong, Ana Granat, Meijian Guan, Yan Guo, Christos Hatzis, Susan Hester, Jennifer A Hipp, Huixiao Hong, Tiffany Hung, Kenneth Idler, Howard Jacob, Erich Jaeger, Parthav Jailwala, Roderick V Jensen, Wendell Jones, Rasika Kalamegham, Bindu Kanakamedala, Jonathan Keats, Liz Kerrigan, Sulev Kõks, Yuliya Kriga, Rebecca Kusko, Samir Lababidi, Kurt Langenbach, Eunice Lee, Jian-Liang Li, You Li, Zhipan Li, Sharon Xueying Liang, Xuelu Liu, Charles Lu, Roberta Maestro, Christopher E Mason, Tim McDaniel, Timothy Mercer, Daoud Meerzaman, Urvashi Mehra, Corey Miles, Chris Miller, Malcolm Moos, Ali Moshrefi, Aparna Natarajan, Baitang Ning, Jessica Nordlund, Cu Nguyen, Jai Pandey, Brian N Papas, Anand Pathak, Eric Peters, Virginie Petitjean, Mehdi Pirooznia, Maurizio Polano, Arati Raziuddin, Wolfgang Resch, Luyao Ren, Andreas Scherer, Gary P Schroth, Fayaz Seifuddin, Steve T Sherry, Jyoti Shetty, Leming Shi, Tieliu Shi, Louis M Staudt, Marc Sultan, Zivana Tezak, Weida Tong, Bao Tran, Jeff Trent, Tiffany Truong, Petr Vojta, Cristobal Juan Vera, Ashley Walton, Charles Wang, Jing Wang, Jingya Wang, Mingyi Wang, James C Willey, Leihong Wu, Chunlin Xiao, Wenming Xiao, Xiaojian Xu, Chunhua Yan, Gokhan Yavas, Ying Yu, Chaoyang Zhang, Yuanting Zheng

Affiliations

¹ Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., Belmont, CA, USA.
² Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
³ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁴ Center for Genomics, Loma Linda University School of Medicine, Loma Linda, CA, USA.
⁵ Department of Allergy and Clinical Immunology, State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
⁶ ATCC (American Type Culture Collection), Manassas, VA, USA.
⁷ Computational Genomics, Genomics Research Center (GRC), AbbVie, North Chicago, IL, USA.
⁸ State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China.
⁹ Illumina Inc., Foster City, CA, USA.
¹⁰ Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
¹¹ Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology (CBIIT), National Cancer Institute, Rockville, MD, USA.
¹² Biomarker Development, Novartis Institutes for Biomedical Research, Basel, Switzerland.
¹³ CCR Collaborative Bioinformatics Resource (CCBR), Office of Science and Technology Resources, Center for Cancer Research, Bethesda, MD, USA.
¹⁴ Genentech, a member of the Roche group, South San Francisco, CA, USA.
¹⁵ IMTM, Faculty of Medicine and Dentistry, Palacky University, Olomouc, Czech Republic.
¹⁶ European Infrastructure for Translational Medicine, Amsterdam, the Netherlands.
¹⁷ Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of Oncogenetics and Functional Oncogenomics, Aviano, Italy.
¹⁸ Perron Institute for Neurological and Translational Science, Nedlands, Western Australia, Australia.
¹⁹ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
²⁰ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
²¹ Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
²² Department of Biological Sciences, Virginia Tech, Blacksburg, VA, USA.
²³ Bioinformatics and Computational Biology Core, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA.
²⁴ Sentieon Inc., Mountain View, CA, USA.
²⁵ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
²⁶ Immuneering Corporation, Boston, MA, USA.
²⁷ Center for Biologics Evaluation and Research, FDA, Silver Spring, MD, USA.
²⁸ Center for Drug Evaluation and Research, FDA, Silver Spring, MD, USA.
²⁹ Center for Devices and Radiological Health, FDA, Silver Spring, MD, USA.
³⁰ National Center for Toxicological Research, FDA, Jefferson, AR, USA.
³¹ Department of Basic Science, Loma Linda University School of Medicine, Loma Linda, CA, USA.
³² 10x Genomics, Pleasanton, CA, USA.
³³ Translational Genomics Research Institute, Phoenix, AZ, USA.
³⁴ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
³⁵ National Center for Toxicological Research, FDA, Jefferson, AR, USA. huixiao.hong@fda.hhs.gov.
³⁶ State Key Laboratory of Genetic Engineering, Human Phenome Institute, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China. lemingshi@fudan.edu.cn.
³⁷ Center for Genomics, Loma Linda University School of Medicine, Loma Linda, CA, USA. oxwang@gmail.com.
³⁸ Department of Basic Science, Loma Linda University School of Medicine, Loma Linda, CA, USA. oxwang@gmail.com.
³⁹ Center for Devices and Radiological Health, FDA, Silver Spring, MD, USA. wenming.xiao@fda.hhs.gov.

^# Contributed equally.

PMID: 34504347
PMCID: PMC8532138
DOI: 10.1038/s41587-021-00993-6

Abstract

The lack of samples for generating standardized DNA datasets for setting up a sequencing pipeline or benchmarking the performance of different algorithms limits the implementation and uptake of cancer genomics. Here, we describe reference call sets obtained from paired tumor-normal genomic DNA (gDNA) samples derived from a breast cancer cell line-which is highly heterogeneous, with an aneuploid genome, and enriched in somatic alterations-and a matched lymphoblastoid cell line. We partially validated both somatic mutations and germline variants in these call sets via whole-exome sequencing (WES) with different sequencing platforms and targeted sequencing with >2,000-fold coverage, spanning 82% of genomic regions with high confidence. Although the gDNA reference samples are not representative of primary cancer cells from a clinical sample, when setting up a sequencing pipeline, they not only minimize potential biases from technologies, assays and informatics but also provide a unique resource for benchmarking 'tumor-only' or 'matched tumor-normal' analyses.

PubMed Disclaimer

Conflict of interest statement

Competing interests

L.T.F., S.M.E.S., M. Mohiyuddin, Y.G., L.Y. and H.L. are employees of Roche Sequencing Solutions Inc. L.K., K.L. and M. Mars are employees of ATCC, which provides cell lines and derivative materials. E.J., G.P.S. and O.D.A. are employees of Illumina Inc. V.P. and M.S. are employees of Novartis Institutes for Biomedical Research. T.H., E.P. and R. Kalamegham are employees of Genentech (a member of the Roche group). Z.L. is an employee of Sentieon Inc. R. Kusko is an employee of Immuneering Corp. C.C., S.M. and J.S. are employees of 10x Genomics. All other authors declare no competing interests.

Figures

**Extended Data Fig. 1 |. 3D scatter plot shows the consistency of SomaticSeq and NeuSomatic classification of somatic variant calls.**
3D scatter plot for number of PASS classifications by SomaticSeq, NeuSomatic-E, and VAF for **(a)** SNV (R = 0.997) and **(b)** indel calls (R = 0.925). **(c)** The subset of SNV calls that were re-sequenced by AmpliSeq. Solid markers are deemed ‘validated.’ Open markers are deemed ‘not validated.’ Stars/crosses are deemed uninterpretable. HighConf calls generally have many PASS calls and a full range of VAF. MedConf have fewer PASS calls and tend to have lower VAF. Unclassified calls have a full range of VAF, which means their somatic signals were poor-quality.

**Extended Data Fig. 2 |. Genome coverage and high-confidence regions.**
**(a)** Genome coverage by reads from three technologies. Inner track: PacBio. Middle track: 10X Genomics. Outer track: Illumina HiSeq. **(b)** Genome regions coverage by short reads in comparison to NA12878. Outer black Track: Gene density plot. Middle orange track: NA12878. Inner blue track: the callable regions in HCC1395.

**Extended Data Fig. 3 |. Validation of somatic indels.**
**(a)** Validation of indels by AmpliSeq. R = 0.989 for HighConf calls. **(b)** Validation of indels by WES with Ion Torrent. R = 0.767 for HighConf calls. **(c)** Validation of indels by WES with HiSeq. R = 0.990 for HighConf calls. **(d)** Histogram of indel sizes. The dashed lines on the diagonal for (a), (b), and (c) are the 95% binomial confidence-interval of observed VAF given the actual VAF, calculated based on depths of 2000X for AmpliSeq, 34X for Ion Torrent, and 100X for WES, respectively. (d) shows the indel lengths of the somatic indels in the reference call set.

**Extended Data Fig. 4 |. Validation of germline indels.**
Germline indel scatter plots comparing VAF super set to confirmed VAF. **(a)** VAF scatter plot of germline indels by WGS super set and AmpliSeq. **(b)** VAF scatter plot of germline indels by truth set and Ion Torrent WES.

**Extended Data Fig. 5 |. Karyotyping of HCC1395 and HCC1395BL.**
**(a)** Karyotype of HCC1395. Cytogenetic analysis was performed on ten G-Banded metaphase cells from HCC1395. Analysis pointed to a hypertetraploid line with chromosome counts ranging from 64–79 and gain of 38–63 unidentifiable marker chromosomes. **(b)** Karyotype of HCC1395BL. Cytogenetic analysis was performed on ten G-banded metaphase cells from HCC1395BL. All ten cells showed loss of a chrX and an unbalanced whole arm translocation between the long-arm of chr6 at band q10 and the short-arm of chr16 at band p10. This resulted in a net loss of one copy of the short-arm of chr6 and loss of one copy of the long-arm of chr16. The abnormal chromosome could be placed in either a chr6 or chr16 locus as we were unable to determine if the centromere belongs to chr6 or chr16 (inset figure).

**Extended Data Fig. 6 |. Cytogenetic analysis with Affymetrix Cytoscan HD microarray.**
Cytogenetic analysis with Affymetrix Cytoscan HD microarray. **(a)** Cytogenetic view of HCC1395. **(b)** Cytogenetic view of HCC1395BL. The losses of chr6p, chr16q, and chrX were confirmed.

**Extended Data Fig. 7 |. Variant allele frequencies across the genome.**
**(a)** VAF of truth set germline SNVs in HCC1395BL. The copy numbers of HCC1395BL were predicted by Affymetrix Cytoscan HD microarray. **(b)** VAF of the truth set germline SNV positions (discovered in HCC1395BL) in HCC1395. **(c)** VAF of the truth set somatic SNVs in HCC1395. The copy numbers of HCC1395 were predicted by ascatNgs.

**Extended Data Fig. 8 |. Variant allele frequencies of somatic mutations.**
**(a)** VAFs of somatic SNVs and indels in the reference call sets. **(b)** VAFs of reference SNVs in different copy number states as predicted by ascatNgs.

**Extended Data Fig. 9 |. Tumor sample HCC1395 CNV and Clonality Analysis.**
Tumor sample HCC1395 CNV and Clonality Analysis. (a) Clonality analysis from WES data using SuperFreq for tumor cell line HCC1395. The clonality of each somatic SNV was calculated based on the VAF, accounting for local copy number. The SNVs and CNAs were evaluated with hierarchical clustering based on the clonality and uncertainty across replicates for HCC1395. The river plot shows the relative distribution of multiple subclones in HCC1395. The main cancer clone (blue) and the two subclones (red and green) appeared in early time of clonal evolution, while subclone (orange) and its descendant (peak) appeared in the late event of the clonal evolution. **(b)** The main- and sub-clonal somatic copy number profiles using subHMM38 from the Illumina WGS data set. Main-clonal genotype: upper panel; sub-clonal genotype: middle panel; sub-clonal proportion: bottom bar plot. Each colored block represents the genotype of somatic copy number alterations (SCNAs) in the corresponding position of the chromosome. The chromosomes are separated by vertical dash lines. Genotype of SCNAs: deletion (DEL), homozygous deletion (HOMD), hemizygous deletion loss of heterozygosity (DLOH), copy neutral loss of heterozygosity (NLOH), diploid heterozygous (HET), gain of one allele (GAIN), amplified loss of heterozygosity (ALOH), allele-specific copy number amplification (ASCNA), balanced copy number amplification (BCNA), and unbalanced copy number amplification (UBCNA).

**Extended Data Fig. 10 |. Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature.**
Number of somatic mutations detected in HCC1395 and 560 triple negative and non-triple negative breast cancers from previous literature.

**Fig. 1 |. Schematic of the bioinformatics pipelines used to define the confidence levels of the somatic mutation call set (see Methods for details).**
Twenty-one sequencing replicates for the tumor (HCC1395) and normal (HCC1395BL) gDNA samples were performed at six sequencing centers. They were grouped into five groups shown as colored squares on the far left. The sequencing platforms used were HiSeq and NovaSeq, and the sequencing experiments were performed at Illumina (IL), Fudan University (FD), Novartis (NV), European Infrastructure for Translational Medicine (EA), National Cancer Institute (NC) and Loma Linda University (LL). Each of the 21 sequencing datasets was aligned with three aligners to create a total of 63 pairs of tumor–normal BAM files. For each of the 63 tumor–normal BAM files, we ran six somatic mutation callers (MuTect2, SomaticSniper, VarDict, MuSE, Strelka2 and TNscope) followed by SomaticSeq and NeuSomatic machine learning classifiers using models built specifically for these datasets. Initial tier was assigned to each variant call based on how consistently the variant was classified as a somatic mutation across different sequencing centers and aligners. Then, to rescue low-VAF variants into the reference call set, we ran the same variant-calling pipeline on two higher-depth datasets: an independent 350× replicate sequenced on a HiSeq at Genentech (GT) and a 400× replicate by combining nine NovaSeq replicates at Illumina (IL). High-confidence calls from those two datasets were used to promote less-reproducible low-VAF variants from the 21 sequencing replicates into the reference call set. Then, we combined all of our short-read sequencing data into a pair of 1,500× tumor–normal and used it to rescue additional low-VAF variants into the reference call set. Finally, we cross-referenced our Illumina short-read-based high-confidence calls with PacBio long-read sequencing data and removed a small number of calls that were inconsistent with each other. The high-confidence calls (labeled PASS) were considered ‘true’ somatic mutations, and genomic regions with low-confidence calls were removed from the high-confidence regions. Chr, chromosome; Pos, position.

**Fig. 2 |. Definition and validation of the somatic mutation reference call set.**
a, Breakdown of the somatic variant calls within the consensus callable regions based on the four labels HighConf, MedConf, LowConf and Unclassified. Variant calls labeled HighConf and MedConf are grouped into the reference call set; genomic positions with LowConf and Unclassified calls are removed from the high-confidence regions. b, Histogram of VAFs of the somatic variant calls. c, Average tumor purity fitting scores with 95% confidence intervals for the VAF of each SNV across the four different confidence levels versus the observed VAF in the tumor–normal titration series. The formula for fitting scores is described in equation 1 (see Methods for details). d, Scatter plot of VAFs observed in 21 WGS datasets versus an AmpliSeq targeted sequencing dataset. Solid shapes represent variants that were validated. Open shapes represent variants that were not validated. Sticks represent uninterpretable validation data. The diagonal dashed lines represent the 95% binomial confidence interval of the observed VAF given the actual VAF calculated based on 2,000× depth for AmpliSeq. The figure shows a very high correlation between VAFs estimated from the WGS data and AmpliSeq data for HighConf calls (Pearson’s R = 0.982). Many Unclassified data points lie at the bottom, implying that these calls were not real mutations, despite the large number of apparent variant-supporting reads in the all-inclusive set data; x axis, VAFs calculated from the all-inclusive set; y axis, VAFs calculated from the AmpliSeq data. e, Scatter plot of VAFs observed in WGS datasets versus Ion Torrent WES. The 95% binomial confidence intervals were calculated based on 34× depth for Ion Torrent. Pearson’s R = 0.930 for HighConf calls. f, Scatter plot of VAFs observed in WGS datasets versus 12 repeats of WES on the HiSeq platform; y axis, median VAFs calculated based on 12 HiSeq WES replicates. The 95% binomial confidence intervals were calculated based on 150× depth for HiSeq WES. Pearson’s R = 0.997 for HighConf calls. In d–f, the colors indicate the confidence level of the variant calls, whereas the shapes indicate their validation status.

**Fig. 3 |. Initial definition and validation of germline variants.**
a, Histogram of SCP for germline variants identified by four callers from 63 BAM files. b, VAF scatter plot of germline SNVs by the call set and AmpliSeq data. Pearson’s R = 0.986 for SCP = 1 call. c, VAF scatter plot of germline SNVs by the call set and Ion Torrent WES. Pearson’s R = 0.758 for SCP = 1 call. In b and c, colors indicate the calling probability of the germlines variants, whereas shapes indicate their validation status.

**Fig. 4 |. Clonality analysis of the HCC1395 cell line using bulk DNA and DNA from single cells.**
a, The inferred tumor phylogenetic tree. The subclone S1 represents the most recent common ancestor (MRCA) of all tumor cells, and S2 to S10 represent the subclones with various cancer cell fractions (for example, S2: 60%). The edges represent the evolutionary relationships between subclones. Subclones S3 and S6 are not shown given that their cancer cell fractions were less than 10%. Most point mutations are in driver genes (labeled beside the edges) present in the MRCA. Using the 10x Genomics Single Cell CNV Solution, integer-scaled CNA profiles were obtained across the genomes of 638 HCC1395BL cells (b) and 1,270 HCC1395 cells (c). Noisy cells and cells in the S phase of the cell cycle were removed. The complete linkage method was used for hierarchical clustering. Each row represents a cell being sequenced; similar cells were clustered together based on CNVs. Chromosome-scale gains are in orange and losses are in blue in the heat maps.

See this image and copyright information in PMC

References

1. Gall JG Human genome sequencing. Science 233, 1367–1368 (1986). - PubMed
1. Garraway LA & Lander ES Lessons from the cancer genome. Cell 153, 17–37 (2013). - PubMed
1. Bailey MH et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 (2018). - PMC - PubMed
1. ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020). - PMC - PubMed
1. Hyman DM, Taylor BS & Baselga J Implementing genome-driven oncology. Cell 168, 584–599 (2017). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Collaborators

Affiliations

Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Miscellaneous