Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2020 Sep 21;11(1):4748.
doi: 10.1038/s41467-020-18151-y.

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Collaborators, Affiliations
Comparative Study

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Matthew H Bailey et al. Nat Commun. .

Erratum in

Abstract

The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: G.G. receives research funds from IBM and Pharmacyclics. G.G. is an inventor on patent applications related to MuTect, ABSOLUTE, and other tools. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Workflow and sample inclusion statistics.
a A workflow diagram illustrates the number of mutations present during each step (gradient) of the filtering processes for MC3 (left, blue) and PCAWG (right, red). A brief description of each step of the intersection process is shown in between. b TCGA barcodes and aliquot IDs were used to match somatic sequencing. The exact match of these IDs is shown for various collection aliquots from tissue to plate. c A volcano plot highlights cancer subtype discrepancy between each PCAWG and MC3 with −log10(p-value) on the y-axis and log2(odds ratio) on the x-axis (Fisher’s exact test). The horizontal red bar indicates a significant threshold after multiple testing correction. Positive values indicate an over-representation of a cancer subtype in PCAWG, while negative values indicate an under-representation of a cancer subtype in PCAWG compared to MC3-separated by a vertical red bar. d Sample counts for each cancer type are shown in a bar chart. The colors coordinate with panel c.
Fig. 2
Fig. 2. Landscape of mutations overlap by caller, sample and cancer type.
a UpSetR plot shows the variant calling set intersection by caller. The y-axis indicates set intersection size and the x-axis uses a connected dot plot to indicate which sets are considered. Only the largest 27 intersecting sets are shown. Two insets of the UpSetR plot highlight a classic Euler diagram (left), which indicates the total number of overlapping mutations. A set-size bar chart (right) illustrates the total number of mutations considered from each caller. The concordance set indicates the agreement between WES and WGS. Indel callers are indicated with an asterisk. b A scatter plot shows the amount of concordance by sample by calculating the fraction of matched variants divided by the total number of mutations made by MC3 exome sequencing and PCAWG whole genome sequencing (x and y-axis, respectively) below the total fraction of samples within each quadrant. Each point within the plot is related to tumor portion data collected from the TCGA barcode ID. c As shown above, this box plot separates panel b by cancer types (blue considers all MC3 variants, and red boxes indicate all PCAWG variants). Sample sizes are displayed for each cancer; points indicate samples that extend past 1.5 times the interquartile range; and horizontal bars within each box and whisker indicates median matched mutation fraction.
Fig. 3
Fig. 3. Recoverability simulation and effects of subclones on mutation concordance.
a Observed recovery rate of PCAWG variants in MC3 (red) and of MC3 variants in PCAWG (blue), alongside sequencing noise simulations calculated from random draws of a binomial model that incorporates the VAF and estimated read depth at each site (light red simulates PCAWG recoverability of MC3 variants, and light blue simulated MC3 recoverability of PCAWG variants). Y-axis is described with legend. X-axis displays VAF of the comparative data set in regard to Y. b A stacked bar chart displays the proportion of matched and unique variants (y-axis) for different VAF bins (x-axis). 180 variants did not provide read count information and were removed from this figure. c Stacked proportional histogram shows the fractions of PCAWG matched mutations (purple) and PCAWG-unique mutations (red). Mutations were restricted to SNVs, and subclonality predictions are indicated as either ‘Clonal’ or ‘Sub-clonal’. Columns 2–4 reflect sub-clonal assignment provided by PCAWG (Note: only a few samples reported five predicted subclones and were not included in this analysis). The number of variants represented for each clonal assignment is shown on the x-axis. d Similar to panel c, a stacked proportional histogram illustrates the proportion of matched and unique variants for MC3 which provide estimates of total number of matched or unique variants called by MC3.
Fig. 4
Fig. 4. Screenshots of online tool MAFit. Here we display screenshots from the MAFit on-line interface.
Currently there are three main components to the interface: a A side panel shows sliders and radio buttons to filter data set to remain inclusive. In addition, a download button is available that will download the underlying data table. b MAFit rebuilds Fig. 2b in the first tab of the on-line interface. Each alteration to the radio buttons or VAF sliders will result in an updated figure. In addition, if one’s hovers over a point on the scatter plot, a pop-up window will automatically display, providing the user with basic statistics used to calculate that point, i.e., total number of mutations, number of unique and matched mutations. c A table is also presented based on the selection criteria in panel a.
Fig. 5
Fig. 5. WGS mutations in exonic regions not captured by WES.
a A sunburst diagram provides a breakdown of variants that are removed during the coverage step of the tool. The innermost circle represents the total number of variants identified upon filtering for exome beds used by MC3. Then, we restrict PCAWG variants to well-covered MC3 regions for each sample. The majority of gencode.v19 annotated and the BROAD target bed file of exonic regions are sufficiently covered by PCAWG in flanking regions: 3’UTRs, 5’UTR, and 5’Flanking. The outermost ring illustrates the number mutations identified by PCAWG that were poorly covered by MC3. b A density plot illustrates the density of percent GC-content from a 100 bp window surrounding a variant. Four variant-sets are displayed: matched, private to MC3, private to PCAWG, and we extend our dataset to include exonic variants not covered by WES but sufficiently covered in WGS (Covered by PCAWG only). c A scatter plot displays mean sequence depth (y-axis) by increasing GC-content bins (x-axis). Points are colored according to variant set (same as panel b). df Total annotated mutations counts from 3 different annotated regions are shown for 5UTR, 3UTR, and missense mutations, respectively. g Expression Z−Scores for 3’UTR using all TCGA-UCEC samples. Cis-RNAseq expression violin plots are displayed for 13 genes. On top of the gene-level distribution violin plot, box and whisker plots display sample expression based on mutation classification (box include 25th quantile to 75th quantiles, and whiskers extend to 1.5 times the interquartile range).
Fig. 6
Fig. 6. Significantly mutated gene analysis with the inclusion of UTR mutations.
OncoPrint plots were generated using the R package ComplexHeatmap for four cancer types: LUAD (a), LIHC (b), LUSC (c), and SKCM (d). We report all SMGs identified by Bailey et al. 2018, as well as top significantly mutated gene hits from MuSiC that include non-coding mutations. While many non-coding mutations look promising, further investigation yielded little support for driver identification status.

References

    1. Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS ONE9, e111516 (2014). - PMC - PubMed
    1. Koboldt DC, et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22:568–576. doi: 10.1101/gr.129684.111. - DOI - PMC - PubMed
    1. Fan Y, et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 2016;17:178. doi: 10.1186/s13059-016-1029-6. - DOI - PMC - PubMed
    1. Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. - DOI - PMC - PubMed
    1. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. - DOI - PMC - PubMed

Publication types

Substances