Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 9;8(1):296.
doi: 10.1038/s41597-021-01077-5.

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Affiliations

Whole genome and exome sequencing reference datasets from a multi-center and cross-platform benchmark study

Yongmei Zhao et al. Sci Data. .

Abstract

With the rapid advancement of sequencing technologies, next generation sequencing (NGS) analysis has been widely applied in cancer genomics research. More recently, NGS has been adopted in clinical oncology to advance personalized medicine. Clinical applications of precision oncology require accurate tests that can distinguish tumor-specific mutations from artifacts introduced during NGS processes or data analysis. Therefore, there is an urgent need to develop best practices in cancer mutation detection using NGS and the need for standard reference data sets for systematically measuring accuracy and reproducibility across platforms and methods. Within the SEQC2 consortium context, we established paired tumor-normal reference samples and generated whole-genome (WGS) and whole-exome sequencing (WES) data using sixteen library protocols, seven sequencing platforms at six different centers. We systematically interrogated somatic mutations in the reference samples to identify factors affecting detection reproducibility and accuracy in cancer genomes. These large cross-platform/site WGS and WES datasets using well-characterized reference samples will represent a powerful resource for benchmarking NGS technologies, bioinformatics pipelines, and for the cancer genomics studies.

PubMed Disclaimer

Conflict of interest statement

Li Tai Fang is employee of Roche Sequencing Solutions Inc. Erich Jaeger is employee of Illumina Inc. Virginie Petitjean and Marc Sultan are employees of Novartis Institutes for Biomedical Research. Tiffany Hung and Eric Peters are employees of Genentech (a member of the Roche group). All other authors claim no conflicts of interest. This is a research study, not intended to guide clinical applications. The views presented in this article do not necessarily reflect current or future opinion or policy of the US Food and Drug Administration. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services. Any mention of commercial products is for clarification and not intended as endorsement.

Figures

Fig. 1
Fig. 1
Study design for the experiment. DNA was extracted from either fresh cells or FFPE processed cells. Both fresh DNA and FFPE DNA were profiled on WGS and WES platforms for intra-center, inter-center and cross-platform reproducibility benchmarking. For fresh DNA, six centers performed WGS and WES in parallel following manufacture recommended protocols with limited deviation. Three library preparation protocols (TruSeq-Nano, Nextera Flex, and TruSeq PCR-free,) were used with four different quantities of DNA inputs (1, 10, 100, and 250 ng). DNA from HCC1395 and HCC1395BL was pooled at various ratios to create mixtures of 75%, 50%, 20%, 10%, and 5%. For FFPE samples, each fixation time point (1h, 2 h, 6 h, 24 h) had six blocks that were sequenced at two different centers. All libraries from these experiments were sequenced on the HiSeq series. In addition, nine libraries using the TruSeq PCR-free preparation were run on a NovaSeq for WGS analysis.
Fig. 2
Fig. 2
Overall data quality for WGS and WES data sets from Illumina platform. (a) Percentage of total reads mapped to reference genome (hg38) for WGS (Green) and WES (Red) across 6 sequencing sites. (b) Mean coverage depth for WGS libraries across 6 sequencing sites. (c) Mean coverage depth in target capture regions for WES libraries across 6 sequencing sites. (d) Percentage of non-duplicated reads mapped to reference genome across 6 sequencing sites. WGS (Green) and WES (Red). (e) Percent GC content from different library prep protocols. WGS (Green) and WES (Red). (f) Mean insert size distribution from different library prep protocols. WGS (Green) and WES (Red).
Fig. 3
Fig. 3
Genome coverage from WGS data from three technologies including Illumina, PacBio, and 10X Genomics. Outer rainbow color track: chromosomes, red track: HCC1395, green track: HCC1395BL. (a) Genome coverage from WGS data by reads from Illumina platform. (b) Genome coverage from WGS data by reads from 10X Chromium linked-read technology (c) Genome coverage from WGS data by reads from PacBio platform. (d) Genome coverage plots generated using Indexcov software for whole genome sequencing cross-site comparison libraries. The estimated coverages along chromosome 6 for HCC1395BL (top) and HCC1395 (bottom) are shown. The net loss of one copy of the short-arm of chr6 is shown for HCC1395BL on top. For tumor HCC1395 cell line, there are many copy number gain or loss as shown in bottom of the read coverage plot for chromosome 6.
Fig. 4
Fig. 4
Evaluation of DNA damage for WGS and WES libraries. using GIV scores to capture the DNA damage due to the artifacts introduced during genomic library preparation. The estimation of damage is a global estimation based in an imbalance between R1 and R2 variant frequency. GIV score above 1.5 is defined as damaged. Undamaged DNA samples have a GIV score of 1. (a) DNA damage estimated for fresh cell prepared DNA for WGS Illumina libraries across different sites. (b) DNA damage estimated for FFPE WGS Illumina libraries. (c) DNA damage estimated for fresh cells prepared DNA for WES Illumina libraries across different sites (d) DNA damage estimated for FFPE WES Illumina libraries.
Fig. 5
Fig. 5
Reproducibility of somatic mutation calling from WES and WGS. The reproducibility UpSet plots for 12 repeated WES (a) and WGS runs (b). The number in each plot represents the reproducibility across the different replicates. (c) SNVs/indels calling concordance between WES and WGS from twelve repeated runs. For direct comparison, SNVs/indels from WGS runs were limited to genomic regions defined by an exome capturing kit (SureSelect V6 + UTR). WES is shown on the left in the Venn diagram and WGS is on the right. Shown coverage depths for WES and WGS were effective mean sequence coverage on exome region, i.e. coverage by total number of mapped reads after trimming. (d) Correlation of MAF in overlapping WGS and WES SNVs/indels from repeated runs.

Dataset use reported in

  • doi: 10.1038/s41587-021-00993-6
  • doi: 10.1038/s41587-021-00994-5

References

    1. Morash M, Mitchell H, Beltran H, Elemento O, Pathak J. The Role of Next-Generation Sequencing in Precision Medicine: A Review of Outcomes in Oncology. J Pers Med. 2018;8(3):30. doi: 10.3390/jpm8030030. - DOI - PMC - PubMed
    1. Xiao W, et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat Biotechnol. 2021;39:1141–1150. doi: 10.1038/s41587-021-00994-5. - DOI - PMC - PubMed
    1. Fang LT, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol. 2021;39:1151–1160. doi: 10.1038/s41587-021-00993-6. - DOI - PMC - PubMed
    1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. - DOI - PMC - PubMed
    1. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at arXiv, https://arxiv.org/abs/1303.3997 (2013).

Publication types