. 2021 Nov 9;8(1):296.

doi: 10.1038/s41597-021-01077-5.

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Yongmei Zhao¹, Li Tai Fang², Tsai-Wei Shen³, Sulbha Choudhari³, Keyur Talsania³, Xiongfong Chen³, Jyoti Shetty⁴, Yuliya Kriga⁴, Bao Tran⁴, Bin Zhu⁵, Zhong Chen⁶, Wanqiu Chen⁶, Charles Wang⁶, Erich Jaeger⁷, Daoud Meerzaman⁸, Charles Lu⁹, Kenneth Idler⁹, Luyao Ren¹⁰, Yuanting Zheng¹⁰, Leming Shi¹⁰, Virginie Petitjean¹¹, Marc Sultan¹¹, Tiffany Hung¹², Eric Peters¹², Jiri Drabek^{13

14}, Petr Vojta^{13

14}, Roberta Maestro^{14

15}, Daniela Gasparotto^{14

15}, Sulev Kõks^{14

16

17}, Ene Reimann^{14

18}, Andreas Scherer^{14

19}, Jessica Nordlund^{14

20}, Ulrika Liljedahl^{14

20}, Jonathan Foox²¹, Christopher E Mason²¹, Chunlin Xiao²², Huixiao Hong²³, Wenming Xiao²⁴

Affiliations

¹ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA. Yongmei.Zhao@nih.gov.
² Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., Belmont, CA, USA.
³ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁴ Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁵ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Center for Genomics, School of Medicine, Loma Linda University, Loma Linda, CA, USA.
⁷ Core Applications Group, Product Development, Illumina Inc, Foster City, CA, USA.
⁸ Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁹ AbbVie Genomics Research Center, North Chicago, IL, USA.
¹⁰ State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China.
¹¹ Biomarker Development, Novartis Institutes for Biomedical Research, Basel, Switzerland.
¹² Companion Diagnostics Development, Oncology Biomarker Development, Genentech, South San Francisco, CA, USA.
¹³ IMTM, Faculty of Medicine and Dentistry, Palacky University, Olomouc, Czech Republic.
¹⁴ Member of EATRIS ERIC - European Infrastructure for Translational Medicine, Amsterdam, The Netherlands.
¹⁵ Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of Oncogenetics and Functional Oncogenomics, Aviano, Italy.
¹⁶ Perron Institute for Neurological and Translational Science, Nedlands, Australia.
¹⁷ Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Murdoch, Australia.
¹⁸ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
¹⁹ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
²⁰ Department of Medical Sciences, Molecular Precision Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
²¹ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
²² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
²³ National Center for Toxicological Research, U.S. Food and Drug Administration, FDA, Jefferson, AR, USA.
²⁴ The Center for Drug Evaluation and Research, U.S. Food and Drug Administration, FDA, Silver Spring, MD, USA. Wenming.Xiao@fda.hhs.gov.

PMID: 34753956
PMCID: PMC8578599
DOI: 10.1038/s41597-021-01077-5

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Yongmei Zhao et al. Sci Data. 2021.

. 2021 Nov 9;8(1):296.

doi: 10.1038/s41597-021-01077-5.

Authors

Affiliations

¹ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA. Yongmei.Zhao@nih.gov.
² Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., Belmont, CA, USA.
³ Advanced Biomedical and Computational Sciences, Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁴ Sequencing Facility, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, MD, USA.
⁵ Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁶ Center for Genomics, School of Medicine, Loma Linda University, Loma Linda, CA, USA.
⁷ Core Applications Group, Product Development, Illumina Inc, Foster City, CA, USA.
⁸ Computational Genomics and Bioinformatics Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
⁹ AbbVie Genomics Research Center, North Chicago, IL, USA.
¹⁰ State Key Laboratory of Genetic Engineering, School of Life Sciences and Shanghai Cancer Center, Fudan University, Shanghai, China.
¹¹ Biomarker Development, Novartis Institutes for Biomedical Research, Basel, Switzerland.
¹² Companion Diagnostics Development, Oncology Biomarker Development, Genentech, South San Francisco, CA, USA.
¹³ IMTM, Faculty of Medicine and Dentistry, Palacky University, Olomouc, Czech Republic.
¹⁴ Member of EATRIS ERIC - European Infrastructure for Translational Medicine, Amsterdam, The Netherlands.
¹⁵ Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of Oncogenetics and Functional Oncogenomics, Aviano, Italy.
¹⁶ Perron Institute for Neurological and Translational Science, Nedlands, Australia.
¹⁷ Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Murdoch, Australia.
¹⁸ Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
¹⁹ Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland.
²⁰ Department of Medical Sciences, Molecular Precision Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
²¹ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, USA.
²² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
²³ National Center for Toxicological Research, U.S. Food and Drug Administration, FDA, Jefferson, AR, USA.
²⁴ The Center for Drug Evaluation and Research, U.S. Food and Drug Administration, FDA, Silver Spring, MD, USA. Wenming.Xiao@fda.hhs.gov.

PMID: 34753956
PMCID: PMC8578599
DOI: 10.1038/s41597-021-01077-5

Abstract

With the rapid advancement of sequencing technologies, next generation sequencing (NGS) analysis has been widely applied in cancer genomics research. More recently, NGS has been adopted in clinical oncology to advance personalized medicine. Clinical applications of precision oncology require accurate tests that can distinguish tumor-specific mutations from artifacts introduced during NGS processes or data analysis. Therefore, there is an urgent need to develop best practices in cancer mutation detection using NGS and the need for standard reference data sets for systematically measuring accuracy and reproducibility across platforms and methods. Within the SEQC2 consortium context, we established paired tumor-normal reference samples and generated whole-genome (WGS) and whole-exome sequencing (WES) data using sixteen library protocols, seven sequencing platforms at six different centers. We systematically interrogated somatic mutations in the reference samples to identify factors affecting detection reproducibility and accuracy in cancer genomes. These large cross-platform/site WGS and WES datasets using well-characterized reference samples will represent a powerful resource for benchmarking NGS technologies, bioinformatics pipelines, and for the cancer genomics studies.

PubMed Disclaimer

Conflict of interest statement

Li Tai Fang is employee of Roche Sequencing Solutions Inc. Erich Jaeger is employee of Illumina Inc. Virginie Petitjean and Marc Sultan are employees of Novartis Institutes for Biomedical Research. Tiffany Hung and Eric Peters are employees of Genentech (a member of the Roche group). All other authors claim no conflicts of interest. This is a research study, not intended to guide clinical applications. The views presented in this article do not necessarily reflect current or future opinion or policy of the US Food and Drug Administration. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services. Any mention of commercial products is for clarification and not intended as endorsement.

Figures

**Fig. 1**
Study design for the experiment. DNA was extracted from either fresh cells or FFPE processed cells. Both fresh DNA and FFPE DNA were profiled on WGS and WES platforms for intra-center, inter-center and cross-platform reproducibility benchmarking. For fresh DNA, six centers performed WGS and WES in parallel following manufacture recommended protocols with limited deviation. Three library preparation protocols (TruSeq-Nano, Nextera Flex, and TruSeq PCR-free,) were used with four different quantities of DNA inputs (1, 10, 100, and 250 ng). DNA from HCC1395 and HCC1395BL was pooled at various ratios to create mixtures of 75%, 50%, 20%, 10%, and 5%. For FFPE samples, each fixation time point (1h, 2 h, 6 h, 24 h) had six blocks that were sequenced at two different centers. All libraries from these experiments were sequenced on the HiSeq series. In addition, nine libraries using the TruSeq PCR-free preparation were run on a NovaSeq for WGS analysis.

**Fig. 2**
Overall data quality for WGS and WES data sets from Illumina platform. **(a)** Percentage of total reads mapped to reference genome (hg38) for WGS (Green) and WES (Red) across 6 sequencing sites. **(b)** Mean coverage depth for WGS libraries across 6 sequencing sites. (c) Mean coverage depth in target capture regions for WES libraries across 6 sequencing sites. **(d)** Percentage of non-duplicated reads mapped to reference genome across 6 sequencing sites. WGS (Green) and WES (Red). (e) Percent GC content from different library prep protocols. WGS (Green) and WES (Red). (f) Mean insert size distribution from different library prep protocols. WGS (Green) and WES (Red).

**Fig. 3**
Genome coverage from WGS data from three technologies including Illumina, PacBio, and 10X Genomics. Outer rainbow color track: chromosomes, red track: HCC1395, green track: HCC1395BL. **(a)** Genome coverage from WGS data by reads from Illumina platform. **(b)** Genome coverage from WGS data by reads from 10X Chromium linked-read technology **(c)** Genome coverage from WGS data by reads from PacBio platform. **(d)** Genome coverage plots generated using Indexcov software for whole genome sequencing cross-site comparison libraries. The estimated coverages along chromosome 6 for HCC1395BL (top) and HCC1395 (bottom) are shown. The net loss of one copy of the short-arm of chr6 is shown for HCC1395BL on top. For tumor HCC1395 cell line, there are many copy number gain or loss as shown in bottom of the read coverage plot for chromosome 6.

**Fig. 4**
Evaluation of DNA damage for WGS and WES libraries. using GIV scores to capture the DNA damage due to the artifacts introduced during genomic library preparation. The estimation of damage is a global estimation based in an imbalance between R1 and R2 variant frequency. GIV score above 1.5 is defined as damaged. Undamaged DNA samples have a GIV score of 1. **(a)** DNA damage estimated for fresh cell prepared DNA for WGS Illumina libraries across different sites. **(b)** DNA damage estimated for FFPE WGS Illumina libraries. **(c)** DNA damage estimated for fresh cells prepared DNA for WES Illumina libraries across different sites **(d)** DNA damage estimated for FFPE WES Illumina libraries.

**Fig. 5**
Reproducibility of somatic mutation calling from WES and WGS. The reproducibility UpSet plots for 12 repeated WES **(a)** and WGS runs **(b)**. The number in each plot represents the reproducibility across the different replicates. **(c)** SNVs/indels calling concordance between WES and WGS from twelve repeated runs. For direct comparison, SNVs/indels from WGS runs were limited to genomic regions defined by an exome capturing kit (SureSelect V6 + UTR). WES is shown on the left in the Venn diagram and WGS is on the right. Shown coverage depths for WES and WGS were effective mean sequence coverage on exome region, i.e. coverage by total number of mapped reads after trimming. **(d)** Correlation of MAF in overlapping WGS and WES SNVs/indels from repeated runs.

See this image and copyright information in PMC

Dataset use reported in

doi: 10.1038/s41587-021-00993-6
doi: 10.1038/s41587-021-00994-5

References

1. Morash M, Mitchell H, Beltran H, Elemento O, Pathak J. The Role of Next-Generation Sequencing in Precision Medicine: A Review of Outcomes in Oncology. J Pers Med. 2018;8(3):30. doi: 10.3390/jpm8030030. - DOI - PMC - PubMed
1. Xiao W, et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat Biotechnol. 2021;39:1141–1150. doi: 10.1038/s41587-021-00994-5. - DOI - PMC - PubMed
1. Fang LT, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol. 2021;39:1151–1160. doi: 10.1038/s41587-021-00993-6. - DOI - PMC - PubMed
1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. - DOI - PMC - PubMed
1. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv, https://arxiv.org/abs/1303.3997 (2013).

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

HHSN261201800001C/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Affiliations

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Dataset use reported in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical