Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 31:6:e21778.
doi: 10.7554/eLife.21778.

Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate

Affiliations

Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate

Malene Juul et al. Elife. .

Abstract

Non-coding mutations may drive cancer development. Statistical detection of non-coding driver regions is challenged by a varying mutation rate and uncertainty of functional impact. Here, we develop a statistically founded non-coding driver-detection method, ncdDetect, which includes sample-specific mutational signatures, long-range mutation rate variation, and position-specific impact measures. Using ncdDetect, we screened non-coding regulatory regions of protein-coding genes across a pan-cancer set of whole-genomes (n = 505), which top-ranked known drivers and identified new candidates. For individual candidates, presence of non-coding mutations associates with altered expression or decreased patient survival across an independent pan-cancer sample set (n = 5454). This includes an antigen-presenting gene (CD1A), where 5'UTR mutations correlate significantly with decreased survival in melanoma. Additionally, mutations in a base-excision-repair gene (SMUG1) correlate with a C-to-T mutational-signature. Overall, we find that a rich model of mutational heterogeneity facilitates non-coding driver identification and integrative analysis points to candidates of potential clinical relevance.

Keywords: cancer; cancer biology; computational biology; driver detection; human; mutational processes; non-coding mutations; systems biology.

PubMed Disclaimer

Conflict of interest statement

The authors declare that no competing interests exist.

Figures

Figure 1.
Figure 1.. Variation in mutation rate at different scales and various explanatory variables.
(A) The flowchart illustrates the input to the model fit that predicts the position- and sample-specific mutational probabilities. (B) The number of mutations observed per sample divided into the 14 different cancer types. (C) The set of genomic annotations used as explanatory variables are illustrated on a 300 kb region of chromosome 1 for the colorectal cancer sample CRC_TCGA-A6-6141-01A. For illustrative purposes, the nucleotide sequence is shown on a 30 bp section of chromosome 1 and trinucleotides likewise on a 5 bp section. DOI: http://dx.doi.org/10.7554/eLife.21778.003
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. The average number of mutations observed per sample per bp for each of the considered element types, as well as for intergenic regions.
DOI: http://dx.doi.org/10.7554/eLife.21778.005
Figure 2.
Figure 2.. Position- and sample-specific predicted mutation rates and scoring-schemes.
(A) A multinomial logistic regression model is used to predict the sample- and position-specific background mutation-probabilities. (B) The genomic annotations and the reference sequence (Figure 1) are used as explanatory variables in a regression fit of the somatic mutation rate. In effect, a logistic regression model is fitted for each of the four types of outcome (three types of mutation and no mutation) and combined into a multinomial logistic regression fit. Logistic regression ensures probability-predictions between zero and one. The mutation probabilities are of such small magnitude that we observe near linearity of the logistic regression curve. (C) Sample- and position-specific predicted mutation probabilities for each of the four outcomes in a 300 bp region of chromosome 1 (chr1:115,824,535–115,824,834) for the colorectal cancer sample CRC_TCGA-A6-6141-01A. (D) Observed sample-specific somatic mutations within the same region. For the sample in question, two mutations are observed; one of type TV{A→T, G→T} and one of type TV{A→C, G→C}. (E) Sample- and position-specific scores for each of the three considered scoring schemes. DOI: http://dx.doi.org/10.7554/eLife.21778.007
Figure 3.
Figure 3.. ncdDetect analysis concepts.
(A) Flowchart of the algorithmic steps of ncdDetect. Panels B through E show the sample-specific calculations, while panels F and G show the calculations across samples. (B) The genomic candidate region is annotated with position- and sample-specific scores. The values of these scores depend on the choice of scoring scheme. (C) The region is also annotated with sample- and position-specific predicted mutation probabilities. These probabilities are predicted by the null model and does not depend on the choice of scoring scheme. (D) The observed score of the sample is defined as the sum of the scores associated with the observed mutational events. Scores based on number of mutations and conservation will assign non-mutated positions with a score-value of zero. Scores based on log-likelihoods will assign non-mutated positions with a positive score-value, which in practice will be near zero. (E) The sample-specific background score-distribution is obtained by convolution. (F) Sample-specific calculations are carried out for each individual sample in the dataset. (G) The overall background score-distribution is obtained by convolution of the individual-sample distributions. This figure is conceptual and not based on actual data. Figure 4D–F are real examples of background score-distributions. DOI: http://dx.doi.org/10.7554/eLife.21778.009
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Illustration of time complexity of the ncdDetect algorithm.
Each point illustrates the CPU time in seconds used to calculate the background score distribution for a candidate region of a given length for a given number of samples. DOI: http://dx.doi.org/10.7554/eLife.21778.010
Figure 4.
Figure 4.. Analysis of protein-coding genes to evaluate ncdDetect performance.
(A) The final null model is obtained through forward model-selection. The QQ-plot shows the p-values of all genes (n = 19,256) plotted against their uniform expectation under the null for each of the five models considered. Deviations from the expectations (red identity line) are seen for a varying proportion of the genes (0.5–10%). Results are shown for conservation scores. Similar plots for log-likelihoods and number of mutations are shown in Figure 4—figure supplement 1. (B) Venn diagram showing the overlap between protein-coding genes called as drivers by ncdDetect (q<0.10) for the three scoring schemes and the COSMIC Gene Census list. (C) COSMIC Gene Census recall plot. The fraction of COSMIC genes recalled in the top ncdDetect candidates. (D–F) The two most significant genes called by ncdDetect are TP53 and PIK3CA. An example of a gene not called significant is SLFN11. For each of these, the convoluted background score-distributions are shown together with the observed scores and resulting p-values. DOI: http://dx.doi.org/10.7554/eLife.21778.011
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Analysis of protein-coding genes to evaluate ncdDetect performance for scores defined by log-likelihoods and the number of mutations.
The final null model is obtained by forward model selection. The QQ-plots show the p-values of all genes (n = 19,256) plotted against their uniform expectation under the null for each of the models considered. (A) Results using log-likelihoods. (B) Results using number of mutations. The corresponding plot for conservation scores are shown in Figure 4A. DOI: http://dx.doi.org/10.7554/eLife.21778.017
Figure 4—figure supplement 2.
Figure 4—figure supplement 2.. The p-values (based on conservation scores) plotted as a function of the total number of mutations across samples observed per bp for all protein-coding genes.
The point size indicates gene length. The mean number of mutations per bp is on average eight times higher for the COSMIC genes detected by ncdDetect compared to the undetected COSMIC genes. DOI: http://dx.doi.org/10.7554/eLife.21778.019
Figure 5.
Figure 5.. Q-values and top-ten ranking non-coding elements for each of the three proposed scoring schemes.
The results discussed in the text relate to conservation scores. Non-coding elements associated to COSMIC genes are highlighted in red. For each element, the region size is given together with the observed number of mutations and the expected number of mutations under the null model. (A) The QQ-plot shows the p-values for all promoter elements (n = 19,157) plotted against their uniform expectation under the null. One hundred and sixty promoter elements are found to be significant. (B) QQ-plot of p-values for all splice sites (n = 17,867). The p-values do not follow the expectation under the null. This is explained by the fact that 90% of all splice sites carry no mutations. Three splice sites come up significant with ncdDetect after correcting for multiple testing. DOI: http://dx.doi.org/10.7554/eLife.21778.021
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Q-values and top-ten ranking elements for each of the three proposed scoring schemes.
Protein-coding COSMIC genes, or non-coding elements associated to COSMIC genes, are highlighted in red. For each element, the region size is given together with the observed number of mutations and the expected number of mutations under the null model. (A) The QQ-plot shows the p-values for all protein-coding genes (n = 19,256) plotted against their uniform expectation under the null. Sixty-four protein-coding genes are found to be significant (conservation scores). (B) QQ-plot of p-values for all 5’ UTRs (n = 18,220). In total, 86 5’ UTRs are significant. (C) QQ-plot of p-values for all 3’ UTRs (n = 18,481), of which 16 are found to be significant. The complete sets of significant elements for each region type are given in Supplementary files 1–3. Similar plots for promoter elements and splice sites are shown in Figure 5. DOI: http://dx.doi.org/10.7554/eLife.21778.023
Figure 5—figure supplement 2.
Figure 5—figure supplement 2.. The number of elements called significant for each of the three proposed scoring schemes, for each of the defined element types.
The use of log-likelihoods results in the highest number of elements called significant across most element types, and the use of the number of mutations results in the fewest. DOI: http://dx.doi.org/10.7554/eLife.21778.025
Figure 5—figure supplement 3.
Figure 5—figure supplement 3.. Length distributions of all defined element types.
DOI: http://dx.doi.org/10.7554/eLife.21778.027
Figure 6.
Figure 6.. SMUG1 mutations and base excision repair.
(A) Genomic overview of SMUG1 showing its promoter region (Kent et al., 2002). The DNase clusters track shows DNase hypersensitive regions where the darkness is proportional to the maximum signal strength observed in any cell line (ENCODE Project Consortium, 2012). The transcription-factor-binding sites (TFBSs) track shows core regions of transcription factor binding (Gerstein et al., 2012). The phyloP track shows evolutionary conservation of positions (Pollard et al., 2010). (B) Uracil-DNA glycosylase deficiency signature definition: (1) Cytosines may be methylated (orange circles) at CpG sites (gray box). (2) Spontaneous deamination (red boxes) of non-methylated cytosine results in uracil, causing U:G mismatches. Spontaneous deamination of methylated cytosine results in thymine, causing T:G mismatches. (3a) SMUG1 and UNG are uracil-DNA glycosylases, which, via base excision repair, will repair the U:G mismatches caused by deamination. (3b) If unrepaired, the U:G mismatches will result in G→A mutations. (C) A one-sided Wilcoxon rank sum test is performed per cancer type to investigate if samples with a SMUG1 mutation have a higher value of the uracil-DNA glycosylase deficiency signature statistic than samples without such a mutation. The analysis is based on the 505 whole genome TCGA samples. Each dot represents a sample, and the color represents the SMUG1-associated mutated element. (D) Correlation between the uracil-DNA glycosylase deficiency signature statistic and the product of SMUG1 and UNG gene expression using TCGA exome data for lung adenocarcinoma. DOI: http://dx.doi.org/10.7554/eLife.21778.029
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. Examples of correlation between the uracil-DNA glycosylase deficiency signature statistic and SMUG1 gene expression (first column), UNG gene expression (second column) and the product of SMUG1 and UNG gene expression (third column) using TCGA exome data for seven different cancer types (rows).
The correlation is assessed using one-sided Spearman’s correlation tests. For some cancer types, the correlation coefficients are positive, although these cases are not significant and generally based on few samples. DOI: http://dx.doi.org/10.7554/eLife.21778.032
Figure 7.
Figure 7.. Survival- and expression analysis of CD1A, PRSS3 and STK11 mutations.
(A) Kaplan-Meier survival curves for melanoma samples with and without mutations in the 5’ UTR of CD1A. For illustration purposes, the data are shown for a follow-up time of 2000 days, at which point 98 out of 324 patients (30%) are still at risk. The analysis is based on the TCGA exome sample set. (B) Kaplan-Meier survival curves for HNSC patients with and without PRSS3 promoter mutations. The data are shown for a follow-up time of 2000 days, at which point 42 out of 484 patients (9%) are still at risk. The analysis is based on the TCGA exome sample set. (C) Genomic overview of STK11, zooming in on its combined splice sites region. The phyloP track shows evolutionary conservation of positions. (D) A two-sided Wilcoxon rank sum test is performed for LUAD samples from the TCGA exome sample set, to investigate if samples mutated in the splice site region of STK11 have a different gene expression level than samples without such mutations. (E) Kaplan-Meier survival curves for LUAD samples with and without STK11 splice site mutations. The data are shown for a follow-up time of 2000 days, at which point 36 out of 438 patients (8%) are still at risk. The analysis is based on the TCGA exome sample set. DOI: http://dx.doi.org/10.7554/eLife.21778.033
Appendix 1—figure 1.
Appendix 1—figure 1.. Illustration of the motivation behind the overdispersion-based rate adjustment.
For candidate element A, we overestimate the mutation rate, and thus end up with a conservative p-value for this element when analysing it with ncdDetect. For candidate element B, on the other hand, we underestimate the mutation rate. In this case, ncdDetect will produce a p-value that is too small, creating a potential false-positive call. The effect of underestimating the mutation rate will be greater for longer candidate elements. DOI: http://dx.doi.org/10.7554/eLife.21778.042
Appendix 1—figure 2.
Appendix 1—figure 2.. QQ-plots of p-values obtained with and without the overdispersion-based rate adjustment.
(A) QQ-plots of all protein-coding genes (excluding TP53 for illustration purposes). (B) QQ-plots of protein-coding genes shorter than 700 bp. For the shorter genes, the p-values are not particularly inflated. The overdispersion-based rate adjustment does not affect the distribution of p-values much. (C) QQ-plots of protein-coding genes longer than 3000 kb. For the longer genes, the p-values are inflated, and the overdispersion-based rate adjustment effectively corrects for much of this inflation. DOI: http://dx.doi.org/10.7554/eLife.21778.043
Appendix 1—figure 3.
Appendix 1—figure 3.. COSMIC Gene Census recall plot.
The fraction of COSMIC genes recalled in the top ncdDetect and ExInAtor candidates. DOI: http://dx.doi.org/10.7554/eLife.21778.045
Appendix 1—figure 4.
Appendix 1—figure 4.. Illustration of overlap between significant elements found by ncdDetect and other non-coding cancer driver screens.
Highlighted elements are mentioned in the text. (A) Overlap of promoter elements found to be significant with ncdDetect and LARVA, as well as promoter elements previously described in a non-coding cancer driver screen (Weinhold et al., 2014). We note that TERT and PLEKHS1 are also detected by a second non-coding driver screen (Melton et al., 2015). (B) Overlap between 3' UTRs detected by ncdDetect and 3' UTRs detected by a previous study (Weinhold et al., 2014). (C) Overlap between 5' UTRs detected by ncdDetect and 5' UTRs detected by a previous study (Weinhold et al., 2014). We note, that out of the 863 whole genomes analyzed in (Weinhold et al., 2014), 356 are sequenced by the TCGA. These samples appear to be a subset of the 505 TCGA samples analyzed here. The data sets are thus not completely independent. DOI: http://dx.doi.org/10.7554/eLife.21778.048
Author response image 1.
Author response image 1.
DOI: http://dx.doi.org/10.7554/eLife.21778.049

Similar articles

Cited by

  • Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.
    Rheinbay E, Nielsen MM, Abascal F, Wala JA, Shapira O, Tiao G, Hornshøj H, Hess JM, Juul RI, Lin Z, Feuerbach L, Sabarinathan R, Madsen T, Kim J, Mularoni L, Shuai S, Lanzós A, Herrmann C, Maruvka YE, Shen C, Amin SB, Bandopadhayay P, Bertl J, Boroevich KA, Busanovich J, Carlevaro-Fita J, Chakravarty D, Chan CWY, Craft D, Dhingra P, Diamanti K, Fonseca NA, Gonzalez-Perez A, Guo Q, Hamilton MP, Haradhvala NJ, Hong C, Isaev K, Johnson TA, Juul M, Kahles A, Kahraman A, Kim Y, Komorowski J, Kumar K, Kumar S, Lee D, Lehmann KV, Li Y, Liu EM, Lochovsky L, Park K, Pich O, Roberts ND, Saksena G, Schumacher SE, Sidiropoulos N, Sieverling L, Sinnott-Armstrong N, Stewart C, Tamborero D, Tubio JMC, Umer HM, Uusküla-Reimand L, Wadelius C, Wadi L, Yao X, Zhang CZ, Zhang J, Haber JE, Hobolth A, Imielinski M, Kellis M, Lawrence MS, von Mering C, Nakagawa H, Raphael BJ, Rubin MA, Sander C, Stein LD, Stuart JM, Tsunoda T, Wheeler DA, Johnson R, Reimand J, Gerstein M, Khurana E, Campbell PJ, López-Bigas N; PCAWG Drivers and Functional Interpretation Working Group; PCAWG Structural Variation Working Group; Weischenfeldt J, Beroukhim R, Martincorena I, Pedersen JS, Getz G; PCAWG Consortium. Rheinbay E, et al. Nature. 2020 Feb;578(7793):102-111. doi: 10.1038/s41586-020-1965-x. Epub 2020 Feb 5. Nature. 2020. PMID: 32025015 Free PMC article.
  • A pan-cancer atlas of cancer hallmark-associated candidate driver lncRNAs.
    Deng Y, Luo S, Zhang X, Zou C, Yuan H, Liao G, Xu L, Deng C, Lan Y, Zhao T, Gao X, Xiao Y, Li X. Deng Y, et al. Mol Oncol. 2018 Nov;12(11):1980-2005. doi: 10.1002/1878-0261.12381. Epub 2018 Oct 2. Mol Oncol. 2018. PMID: 30216655 Free PMC article.
  • ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation.
    Juul M, Madsen T, Guo Q, Bertl J, Hobolth A, Kellis M, Pedersen JS. Juul M, et al. Bioinformatics. 2019 Jan 15;35(2):189-199. doi: 10.1093/bioinformatics/bty511. Bioinformatics. 2019. PMID: 29945188 Free PMC article.
  • Identifying somatic driver mutations in cancer with a language model of the human genome.
    Zeng G, Zhao C, Li G, Huang Z, Zhuang J, Liang X, Yu X, Fang S. Zeng G, et al. Comput Struct Biotechnol J. 2025 Jan 17;27:531-540. doi: 10.1016/j.csbj.2025.01.011. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 39968174 Free PMC article.
  • Identification of Cancer Drivers at CTCF Insulators in 1,962 Whole Genomes.
    Liu EM, Martinez-Fundichely A, Diaz BJ, Aronson B, Cuykendall T, MacKay M, Dhingra P, Wong EWP, Chi P, Apostolou E, Sanjana NE, Khurana E. Liu EM, et al. Cell Syst. 2019 May 22;8(5):446-455.e8. doi: 10.1016/j.cels.2019.04.001. Epub 2019 May 8. Cell Syst. 2019. PMID: 31078526 Free PMC article.

References

    1. Agresti A. Categorical Data Analysis. John Wiley & Sons; 2013.
    1. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Imielinsk M, Jäger N, Jones DT, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt AN, Valdés-Mas R, van Buuren MM, van 't Veer L, Vincent-Salomon A, Waddell N, Yates LR, Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR, Serena N-Z, Samuel AJ, Sam B, Australian Pancreatic Cancer Genome Initiative. ICGC Breast Cancer Consortium. ICGC MMML-Seq Consortium. ICGC PedBrain Signatures of mutational processes in human Cancer. Nature. 2013;500:415–421. doi: 10.1038/nature12477. - DOI - PMC - PubMed
    1. Alexandrov LB, Stratton MR. Mutational signatures: the patterns of somatic mutations hidden in Cancer genomes. Current Opinion in Genetics & Development. 2014;24:52–60. doi: 10.1016/j.gde.2013.11.014. - DOI - PMC - PubMed
    1. An Q, Robins P, Lindahl T, Barnes DE. C --> T mutagenesis and gamma-radiation sensitivity due to deficiency in the Smug1 and ung DNA glycosylases. The EMBO Journal. 2005;24:2205–2213. doi: 10.1038/sj.emboj.7600689. - DOI - PMC - PubMed
    1. Bates D, Maechler M. MatrixModels: modelling with sparse and dense matrices. 2015 http://CRAN.R-project.org/package=MatrixModels

Publication types