. 2019 Jan 15;35(2):189-199.

doi: 10.1093/bioinformatics/bty511.

ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation

Malene Juul^{1

2}, Tobias Madsen^{1

2}, Qianyun Guo², Johanna Bertl¹, Asger Hobolth², Manolis Kellis³, Jakob Skou Pedersen^{1

2}

Affiliations

¹ Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, DK-8200 Aarhus N, Denmark.
² Bioinformatics Research Centre, Aarhus University, C.F. Mollers Alle 8, DK-8000 Aarhus C, Denmark.
³ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA.

PMID: 29945188
PMCID: PMC6330011
DOI: 10.1093/bioinformatics/bty511

ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation

Malene Juul et al. Bioinformatics. 2019.

. 2019 Jan 15;35(2):189-199.

doi: 10.1093/bioinformatics/bty511.

Authors

Malene Juul^{1

2}, Tobias Madsen^{1

2}, Qianyun Guo², Johanna Bertl¹, Asger Hobolth², Manolis Kellis³, Jakob Skou Pedersen^{1

2}

Affiliations

¹ Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, DK-8200 Aarhus N, Denmark.
² Bioinformatics Research Centre, Aarhus University, C.F. Mollers Alle 8, DK-8000 Aarhus C, Denmark.
³ Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA.

PMID: 29945188
PMCID: PMC6330011
DOI: 10.1093/bioinformatics/bty511

Abstract

Motivation: Understanding the mutational processes that act during cancer development is a key topic of cancer biology. Nevertheless, much remains to be learned, as a complex interplay of processes with dependencies on a range of genomic features creates highly heterogeneous cancer genomes. Accurate driver detection relies on unbiased models of the mutation rate that also capture rate variation from uncharacterized sources.

Results: Here, we analyse patterns of observed-to-expected mutation counts across 505 whole cancer genomes, and find that genomic features missing from our mutation-rate model likely operate on a megabase length scale. We extend our site-specific model of the mutation rate to include the additional variance from these sources, which leads to robust significance evaluation of candidate cancer drivers. We thus present ncdDetect v.2, with greatly improved cancer driver detection specificity. Finally, we show that ranking candidates by their posterior mean value of their effect sizes offers an equivalent and more computationally efficient alternative to ranking by their P-values.

Availability and implementation: ncdDetect v.2 is implemented as an R-package and is freely available at http://github.com/TobiasMadsen/ncdDetect2.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of ncdDetect v.2. For each sample, the genomic features replication timing, expression level and reference sequence are used as explanatory variables to predict the sample- and position specific probabilities of mutation in a multinomial logistic regression model (blue box). For a specific genomic region type, the observed and expected number of mutations are collected for each specific candidate element (red box; illustrated for protein-coding genes). This information is applied to estimate the overdispersion parameter ρ as explained in Section 2.4. The estimated amount of overdispersion is accounted for in the significance evaluation of each candidate element, as explained in Section 2.2. The larger the overdispersion estimate, the harder it will be for a genomic candidate element to reach significance (yellow box)

**Fig. 2.**
Illustration and estimation of overdispersion. (A) The smoothed ratio between the observed and expected number of mutations on representative 5 Mb genomic section (chr6: 113 185 830–118 185 829). Mutation counts are considered in bins of size 1 kb. Bottom: Protein-coding genes with COSMIC CGC genes highlighted (red). In regions where the mutation rate is underestimated (observed-to-expected ratio > 1), ncdDetect v.1 is likely to call false positive cancer drivers. (B) Observed-to-expected number of mutations shown for protein-coding genes as well as 10 000 randomly sampled regions, all of length equal to the median length of protein-coding genes (1300 bps). The ratio between observed and expected mutation counts are shown (red histograms). For each region of a given region type, a set of mutations were sampled directly from the background model. The ratio between the sampled and expected counts are similarly depicted (blue histograms). The overlaid histograms illustrate the two main sources of variation in the observed-to-expected ratio; namely sampling variance and overdispersion. Similar plots for the remaining region types are shown in Supplementary Figure S1. The observed and expected number of mutations for a given region type are used for overdispersion estimation. (C) The estimated overdispersion parameters, including 95% confidence limits, calculated for each considered region type. For protein-coding genes, results are shown both including (red) and excluding (black) known drivers. (D) A typical position with a predicted mutation probability of 10⁻⁶ is considered. The densities show the variation around this point estimate given by the overdispersion estimate for the different region types. The densities are obtained as logit transforms of normal variables, whose variances are determined by the amount of overdispersion. The resulting distribution is known as a logit-normal (Supplementary Section S5)

**Fig. 3.**
Overdispersion as an effect of missing genomic features in the background mutation model. (A) Overdispersion estimates on the basis of background models of increasing complexity. The simplest model with a constant mutation rate across all samples and positions results in the highest overdispersion estimate across all region types. For increasingly complex models, the overdispersion estimate has a decreasing trend. In the model furthest to the right, a correction for local mutation rate is performed, as described in Section 2.7. (B) The autocorrelation between observed-to-expected mutation rate in 1 kb windows, based on approximately 1 Gb scattered across the genome. The autocorrelation decreases slowly over approximately a few megabases, suggesting that genomic features that vary slowly across the genome are missing. (C) Auto-correlation functions for DNase hypersensitivity, h3k9me3 histone modification, replication timing and nucleotide excision repair (XR-seq) for comparison

**Fig. 4.**
Comparison of ranking methods. (A) We define the effect size as the ratio between the observed and expected score of functional impact, here CADD score. The effect size is plotted against the sampling variance of the effect size under the null-model. A high sampling variance means the outcome is uncertain and an extreme effect size is needed to achieve significance. The sampling variance decreases with region length and effect size increases with the number of mutations. (B) Points colored according to their rank under each of the three different ranking methods. Compared to $P {- values}_{uncorrected}, P {- values}_{od}$ and posterior mean (PM) both put more emphasis on effect size compared to sampling variance. (C) Pairwise comparison of ranking methods. The shift in rank for all elements having a positive effect size. $P {- values}_{od}$ generally give higher ranks to short genes with moderate to large effect size. PM has a similar effect, but it does not give higher ranks to the shortest of genes. (D) Pairwise comparison of ranking methods for top-ranked genes. The shift in rank for all elements ranked in top-100 for either of two methods. Long genes are generally ranked much lower using $P {- values}_{od}$ and PM compared to $P {- values}_{uncorrected}$ . (E) The lengths of five selected genes are highlighted in a density plot showing the length distribution of all protein-coding genes. (F) Re-ranking of individual elements of representative lengths when ranking according to $P {- values}_{od}$ and PM. The three long protein-coding genes ZFHX4, ZNF831 and COL6A3 all become less significant when taking overdispersion into account and are down-ranked by PM. The known short COSMIC CGC gene B2M is up-ranked with both overdispersion and posterior mean. G: CADD score-based QQplots of P-values for protein-coding genes obtained with and without overdispersion. QQplots for the remaining region types are shown in Supplementary Figure S2

**Fig. 5.**
Performance of ncdDetect v.2 after correcting for overdispersion. (A) A comparison of the fraction of COSMIC CGC genes among top-ranked candidates. Results shown are obtained by ranking elements according to $P {- values}_{od}$ as well as PM, using both CADD and phyloP scores. For P-value based results, filled points denote significant elements (q < 0.10), while crosses denote insignificant elements (q $\geq$ 0.10). For PM-based results, there are no filled points due to the inability to determine significance in this setting. (B) QQplots of $P {- values}_{od}$ for each considered region type obtained with phyloP, CADD or LINSIGHT scores. Note that LINSIGHT scores are not available for protein-coding genes. (C) Cancer driver candidates identified with ncdDetect v.2 using CADD scores, after accounting for overdispersion. All elements with a colored tile have a q-value less than 0.10. Melanoma-specific results are shown in Supplementary Figure S3

See this image and copyright information in PMC

Cited by

Gsw-fi: a GLM model incorporating shrinkage and double-weighted strategies for identifying cancer driver genes with functional impact.
Xu X, Qi Z, Wang L, Zhang M, Geng Z, Han X. Xu X, et al. BMC Bioinformatics. 2024 Mar 6;25(1):99. doi: 10.1186/s12859-024-05707-8. BMC Bioinformatics. 2024. PMID: 38448819 Free PMC article.
Non-coding driver mutations in human cancer.
Elliott K, Larsson E. Elliott K, et al. Nat Rev Cancer. 2021 Aug;21(8):500-509. doi: 10.1038/s41568-021-00371-z. Epub 2021 Jul 6. Nat Rev Cancer. 2021. PMID: 34230647 Review.
DeepAlloDriver: a deep learning-based strategy to predict cancer driver mutations.
Song Q, Li M, Li Q, Lu X, Song K, Zhang Z, Wei J, Zhang L, Wei J, Ye Y, Zha J, Zhang Q, Gao Q, Long J, Liu X, Lu X, Zhang J. Song Q, et al. Nucleic Acids Res. 2023 Jul 5;51(W1):W129-W133. doi: 10.1093/nar/gkad295. Nucleic Acids Res. 2023. PMID: 37078611 Free PMC article.
Identifying somatic driver mutations in cancer with a language model of the human genome.
Zeng G, Zhao C, Li G, Huang Z, Zhuang J, Liang X, Yu X, Fang S. Zeng G, et al. Comput Struct Biotechnol J. 2025 Jan 17;27:531-540. doi: 10.1016/j.csbj.2025.01.011. eCollection 2025. Comput Struct Biotechnol J. 2025. PMID: 39968174 Free PMC article.
MutSpot: detection of non-coding mutation hotspots in cancer genomes.
Guo YA, Chang MM, Skanderup AJ. Guo YA, et al. NPJ Genom Med. 2020 Jun 5;5:26. doi: 10.1038/s41525-020-0133-4. eCollection 2020. NPJ Genom Med. 2020. PMID: 32550006 Free PMC article.

References

1. Benjamini Y., Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological), 57, 289–300.
1. Bertl J. et al. (2018) A site specific model and analysis of the neutral somatic mutation rate in whole-genome cancer data. BMC Bioinformatics, 19, 147. - PMC - PubMed
1. Box G.E. et al. (2015) Time Series Analysis: Forecasting and Control. John Wiley & Sons Inc., Hoboken, New Jersey, USA.
1. Chen C.-L. et al. (2010) Impact of replication timing on non-CpG and CpG substitution rates in mammalian genomes. Genome Res., 20, 447–457. - PMC - PubMed
1. ENCODE Project Consortium and others. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U01 HG009088/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation

Affiliations

ncdDetect2: improved models of the site-specific mutation rate in cancer and driver detection with robust significance evaluation

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources