Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 19;17(10):e1009542.
doi: 10.1371/journal.pcbi.1009542. eCollection 2021 Oct.

A data-driven approach for constructing mutation categories for mutational signature analysis

Affiliations

A data-driven approach for constructing mutation categories for mutational signature analysis

Gal Gilad et al. PLoS Comput Biol. .

Abstract

Mutational processes shape the genomes of cancer patients and their understanding has important applications in diagnosis and treatment. Current modeling of mutational processes by identifying their characteristic signatures views each base substitution in a limited context of a single flanking base on each side. This context definition gives rise to 96 categories of mutations that have become the standard in the field, even though wider contexts have been shown to be informative in specific cases. Here we propose a data-driven approach for constructing a mutation categorization for mutational signature analysis. Our approach is based on the assumption that tumor cells that are exposed to similar mutational processes, show similar expression levels of DNA damage repair genes that are involved in these processes. We attempt to find a categorization that maximizes the agreement between mutation and gene expression data, and show that it outperforms the standard categorization over multiple quality measures. Moreover, we show that the categorization we identify generalizes to unseen data from different cancer types, suggesting that mutation context patterns extend beyond the immediate flanking bases.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of the computational pipeline.
(A) For each mutation categorization (a set of M mutation categories), a mutation count matrix of N samples by M mutation categories is built through the assignment of 7-mer mutation sequences to their best matching category. (B) For each mutation categorization, its corresponding normalized mutation count matrix is factorized to produce a mutational signature matrix H and an exposure matrix W using NMF. (C) For each categorization, the correlation between the exposure matrix W and the corresponding gene expression data is computed using canonical correlation analysis to determine its fitness. (D) Categorizations are selected from the population for crossover and mutation to produce offspring for the next generation of the genetic algorithm.
Fig 2
Fig 2. Top categories and the percent of data set sequences that are mapped to each of them.
In each category, the bases at the fourth and fifth positions represent the mutation. Flat Xs represent wildcards. The top 20 categories with the highest average prevalence (i.e., the number of sequences that are mapped to a category divided by the number of data set sequences) over all 3 data sets are shown. Each bar is scaled from zero to the highest prevalence value in its column (data set).
Fig 3
Fig 3. Comparative performance evaluation by reconstruction error (log-transformed).
For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the test samples using NNLS. The reported reconstruction error (Kullback–Leibler divergence) is the approximation error of this factorization with respect to the test samples of the (normalized) count matrix V.
Fig 4
Fig 4. Comparative performance evaluation: Correlation to expression of DDR genes (A) and CGC genes (B).
For each K in the range of number of components [2, max(k_cosmic, K*) + 2], we apply NMF to the WGS samples to learn the signature matrix H and then derive the exposure matrix W from the WES samples using NNLS. We learn the CCA coefficients using WES training samples and compute the resulting correlation on the test samples. The reported correlation is the average over 10-fold cross validation. Error bars represent the standard deviation of multiple evaluation runs.
Fig 5
Fig 5. An example DDC signature (#3 in Table 1).
The signature is depicted using both the standard categories (top) and the DDC ones (bottom, categories presented using the IUPAC code). This signature is similar to COSMIC Signature 3 (cosine similarity 0.89), and its exposure is correlated with the expression of DDR genes (Pearson correlation 0.37).

References

    1. Nik-Zainal S, Alexandrov L, Wedge D, Van Loo P, Greenman C, Raine K, et al.. Mutational Processes Molding the Genomes of 21 Breast Cancers. Cell. 2012;149(5):979–993. doi: 10.1016/j.cell.2012.04.024 - DOI - PMC - PubMed
    1. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, et al.. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–421. doi: 10.1038/nature12477 - DOI - PMC - PubMed
    1. Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, et al.. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101. doi: 10.1038/s41586-020-1943-3 - DOI - PMC - PubMed
    1. Haradhvala NJ, Kim J, Maruvka YE, Polak P, Rosebrock D, Livitz D, et al.. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nature Communications. 2018;9(1):1746. doi: 10.1038/s41467-018-04002-4 - DOI - PMC - PubMed
    1. Fang H, Barbour JA, Poulos RC, Katainen R, Aaltonen LA, Wong JWH. Mutational processes of distinct POLE exonuclease domain mutants drive an enrichment of a specific TP53 mutation in colorectal cancer. PLOS Genetics. 2020;16(2):1–20. doi: 10.1371/journal.pgen.1008572 - DOI - PMC - PubMed

Publication types

MeSH terms