Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2016 Oct;48(10):1288-94.
doi: 10.1038/ng.3658. Epub 2016 Sep 12.

Unsupervised detection of cancer driver mutations with parsimony-guided learning

Affiliations
Comparative Study

Unsupervised detection of cancer driver mutations with parsimony-guided learning

Runjun D Kumar et al. Nat Genet. 2016 Oct.

Abstract

Methods are needed to reliably prioritize biologically active driver mutations over inactive passengers in high-throughput sequencing cancer data sets. We present ParsSNP, an unsupervised functional impact predictor that is guided by parsimony. ParsSNP uses an expectation-maximization framework to find mutations that explain tumor incidence broadly, without using predefined training labels that can introduce biases. We compare ParsSNP to five existing tools (CanDrA, CHASM, FATHMM Cancer, TransFIC, and Condel) across five distinct benchmarks. ParsSNP outperformed the existing tools in 24 of 25 comparisons. To investigate the real-world benefit of these improvements, we applied ParsSNP to an independent data set of 30 patients with diffuse-type gastric cancer. ParsSNP identified many known and likely driver mutations that other methods did not detect, including truncation mutations in known tumor suppressors and the recurrent driver substitution RHOA p.Tyr42Cys. In conclusion, ParsSNP uses an innovative, parsimony-based approach to prioritize cancer driver mutations and provides dramatic improvements over existing methods.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Overview of ParsSNP and label learning
A) 1. Label learning begins with a training set of mutations, each belonging to a sample. 2. Descriptors are assigned, and random labels generated (portrayed numbers are illustrative). 3. EM updates labels iteratively such that putative drivers are distributed among samples (E-step) and defined in terms of descriptors (M-step). 4. The final labels and descriptors are used to train a neural network model. 5. The ParsSNP model produces ParsSNP scores when applied to new mutations. B) Distribution of ParsSNP labels after averaging 50 runs (N=566,223). C) Percent contribution of descriptors to ParsSNP scores, using Garson’s algorithm for neural network weights (see text). D) The ParsSNP model was applied to the training and hypermutator pan-cancer sets to produce ParsSNP scores. The fraction of mutations identified as drivers is displayed at various sample mutation burdens and ParsSNP thresholds.
Figure 2
Figure 2. ParsSNP detects recurrent mutations and mutations in known cancer genes in the pan-cancer test set
A) ParsSNP scores plotted against mutation recurrence for missense mutations (N=182,483). Points are jittered to aid visualization. B) The association of ParsSNP and the independent tools to log(mutation recurrence) for missense mutations, measured by R-square. C) ParsSNP identifies 9,434 recurrent missense mutations better than the independent tools (all Delong tests p<2.2e-16, AUROCs are depicted). D) The ability of ParsSNP to detect recurrent missense mutations in the test set is assessed on a gene-by-gene basis. Portrayed genes must be members of the CGC, have at least 25 missense mutations, and have at least 10 mutations in each class. Mutation counts (non-recurrent:recurrent) and 95% confidence intervals are included for each gene. E) Out of 173,049 non-recurrent missense mutations, ParsSNP identifies the 3,760 which occur in the CGC significantly better than the independent tools (all Delong tests p<2.2e-16, AUROCs are depicted). F) CGC genes were divided into putative oncogenes and putative tumor suppressor genes (TSG) based on the molecular genetic annotation from the CGC dataset (dominant or recessive, respectively). The distribution of ParsSNP scores in the test set is displayed by mutation and gene type, with the number of genes and mutations in each category displayed. ‘Truncation’ events include frameshift, premature stop, nonstop and splice-site changes. ‘Missense’ mutations include missense substitutions as well as inframe insertions/deletions. ‘Silent’ changes include synonymous nucleotide substitutions as well as non-coding variants.
Figure 3
Figure 3. ParsSNP identifies experimentally validated mutations in external datasets
A) ParsSNP separates 1,138 driver mutations from 49,880 common SNPs in the driver-dbSNP dataset slightly better than FATHMM Cancer (Delong test p=0.205) and significantly better than the other independent tools (all Delong tests p<1e-4, AUROCs are depicted). B) Plot of ParsSNP scores and P53 transactivation activity change for 2,314 mutations in the IARC dataset. C) The association of ParsSNP and the independent tools with log(P53 activity change) is displayed as measured by R-square values. D) ParsSNP identifies 475 disruptive P53 mutations (mutation P53 activity < 25% of wild type) among 2,314 mutations with similar performance to CHASM (Delong test p=0.39) and Condel (Delong test p=0.59), while CanDrA, FATHMM Cancer and TransFIC perform worse (all Delong tests p<0.05, AUROCs are depicted). E) ParsSNP separates 45 experimentally defined functional variants from 26 experimentally defined neutral variants better than existing methods, with Delong test p<0.05 for all except CHASM (Delong test p=0.085).
Figure 4
Figure 4. Comparison of candidate driver mutations in an independent dataset reveals known and likely drivers which are only identified by ParsSNP
ParsSNP, CanDrA and CHASM were applied to the Kakiuchi et al dataset, which consists of 2,988 protein-coding somatic mutations from 30 diffuse-type gastric carcinoma patients. For each tool, the top 30 predicted drivers (equivalent to 1% of the dataset) were extracted. The overlap between the candidate driver lists from each tool is diagramed (top left), and the candidate drivers themselves are listed according to the tools they were identified by.

References

    1. Forbes SA, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Research. 2011;39:D945–D950. - PMC - PubMed
    1. Vogelstein B, et al. Cancer Genome Landscapes. Science. 2013;339:1546–1558. - PMC - PubMed
    1. Carter H, Douville C, Stenson PD, Cooper DN, Karchin R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics. 2013;14(Suppl 3):S3. - PMC - PubMed
    1. Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics. 2014;46:310–315. - PMC - PubMed
    1. Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–9. - PMC - PubMed