. 2019 May 7;47(8):e45.

doi: 10.1093/nar/gkz096.

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Yi Han¹, Juze Yang², Xinyi Qian², Wei-Chung Cheng³, Shu-Hsuan Liu³, Xing Hua⁴, Liyuan Zhou², Yaning Yang⁵, Qingbiao Wu⁶, Pengyuan Liu², Yan Lu¹

Affiliations

¹ Center for Uterine Cancer Diagnosis and Therapy Research of Zhejiang Province, Women's Reproductive Health Key Laboratory of Zhejiang Province, Women's Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China.
² Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China.
³ Graduate Institute of Biomedical Sciences, Research Center for Tumor Medical Science, and Drug Development Center, China Medical University, Taichung 40402, Taiwan.
⁴ Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Bethesda, MD 20892, USA.
⁵ Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, China.
⁶ Department of Mathematics, Zhejiang University, Hangzhou, Zhejiang 310027, China.

PMID: 30773592
PMCID: PMC6486576
DOI: 10.1093/nar/gkz096

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Yi Han et al. Nucleic Acids Res. 2019.

. 2019 May 7;47(8):e45.

doi: 10.1093/nar/gkz096.

Authors

Yi Han¹, Juze Yang², Xinyi Qian², Wei-Chung Cheng³, Shu-Hsuan Liu³, Xing Hua⁴, Liyuan Zhou², Yaning Yang⁵, Qingbiao Wu⁶, Pengyuan Liu², Yan Lu¹

Affiliations

¹ Center for Uterine Cancer Diagnosis and Therapy Research of Zhejiang Province, Women's Reproductive Health Key Laboratory of Zhejiang Province, Women's Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310006, China.
² Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, Hangzhou, Zhejiang 310016, China.
³ Graduate Institute of Biomedical Sciences, Research Center for Tumor Medical Science, and Drug Development Center, China Medical University, Taichung 40402, Taiwan.
⁴ Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, Bethesda, MD 20892, USA.
⁵ Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, China.
⁶ Department of Mathematics, Zhejiang University, Hangzhou, Zhejiang 310027, China.

PMID: 30773592
PMCID: PMC6486576
DOI: 10.1093/nar/gkz096

Erratum in

Corrigendum to article "DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies''.
Han Y, Yang J, Qian X, Cheng WC, Liu SH, Hua X, Zhou L, Yang Y, Wu Q, Liu P, Lu Y. Han Y, et al. Nucleic Acids Res. 2021 Apr 19;49(7):4196. doi: 10.1093/nar/gkab193. Nucleic Acids Res. 2021. PMID: 33744935 Free PMC article. No abstract available.

Abstract

Although rapid progress has been made in computational approaches for prioritizing cancer driver genes, research is far from achieving the ultimate goal of discovering a complete catalog of genes truly associated with cancer. Driver gene lists predicted from these computational tools lack consistency and are prone to false positives. Here, we developed an approach (DriverML) integrating Rao's score test and supervised machine learning to identify cancer driver genes. The weight parameters in the score statistics quantified the functional impacts of mutations on the protein. To obtain optimized weight parameters, the score statistics of prior driver genes were maximized on pan-cancer training data. We conducted rigorous and unbiased benchmark analysis and comparisons of DriverML with 20 other existing tools in 31 independent datasets from The Cancer Genome Atlas (TCGA). Our comprehensive evaluations demonstrated that DriverML was robust and powerful among various datasets and outperformed the other tools with a better balance of precision and sensitivity. In vitro cell-based assays further proved the validity of the DriverML prediction of novel driver genes. In summary, DriverML uses an innovative, machine learning-based approach to prioritize cancer driver genes and provides dramatic improvements over currently existing methods. Its source code is available at https://github.com/HelloYiHan/DriverML.

PubMed Disclaimer

Figures

**Figure 1.**
Computational tools for identifying cancer driver genes. (A) Classification of 21 driver gene prediction tools evaluated in this study. These widely used tools are classified as frequency-based, hotspot-based and network-based methods. The block size for each tool represents its citation times according to data obtained from the Web Of Science on 27 September 2018 (the larger block size, the more the citation times). MutSigCV is a widely used tool that is the most frequently cited in the literature. It has the largest block size. Two up-to-date tools, rDriver and SCS (published in 2018), along with DriverML, had no citation, and had the smallest block size. (B) Summary of the main workflow of DriverML. DriverML identifies cancer driver genes by combining a weighted score test and machine learning approach. The weights (, T represents the total number of mutation types evaluated in this study) in the score statistics quantify the functional impacts of different mutation types on the protein. To assign optimal weights to different types of non-silent mutations, the score statistics of prior driver genes were maximized in pan-cancer training data based on the machine learning approach. The U and I represent the Rao score function and Fish information, respectively. To test cancer driver genes, the score value of each gene was computed with the weighted score statistic with the learned weight parameters. The empirical null distribution of score statistics, from which P-values of tested genes were calculated, was generated by Monte Carlo simulation.

formula image — **Figure 1.**
Computational tools for identifying cancer driver genes. (A) Classification of 21 driver gene prediction tools evaluated in this study. These widely used tools are classified as frequency-based, hotspot-based and network-based methods. The block size for each tool represents its citation times according to data obtained from the Web Of Science on 27 September 2018 (the larger block size, the more the citation times). MutSigCV is a widely used tool that is the most frequently cited in the literature. It has the largest block size. Two up-to-date tools, rDriver and SCS (published in 2018), along with DriverML, had no citation, and had the smallest block size. (B) Summary of the main workflow of DriverML. DriverML identifies cancer driver genes by combining a weighted score test and machine learning approach. The weights (, T represents the total number of mutation types evaluated in this study) in the score statistics quantify the functional impacts of different mutation types on the protein. To assign optimal weights to different types of non-silent mutations, the score statistics of prior driver genes were maximized in pan-cancer training data based on the machine learning approach. The U and I represent the Rao score function and Fish information, respectively. To test cancer driver genes, the score value of each gene was computed with the weighted score statistic with the learned weight parameters. The empirical null distribution of score statistics, from which P-values of tested genes were calculated, was generated by Monte Carlo simulation.

**Figure 2.**
Fraction of predicted driver genes presented in CGC. The Cancer Gene Census (CGC) in COSMIC consists of 616 genes containing mutations that were associated with cancer. Overlap of the predicted driver genes with the CGC was evaluated. Tools were ordered by the median fraction of predicted drivers in the CGC. For each dataset, the fraction of tools predicting too few genes (<3) was set to zero in case of an abnormally high fraction of overlap. Thirty-one datasets as a whole, DriverML, MutSigCV, DawnRank and rDriver had the highest fractions (42.9%, 33.3%, 33.3% and 33.3%, respectively).

**Figure 3.**
Fraction of predicted genes presented in the list of Mut-driver genes. The list of Mut-driver included 125 genes that were identified from 3284 tumors according to the mutation pattern of 20/20 rule (Vogelstein *et al.*, 2013). DriverML, MutSigCV, DawnRank and rDriver had the highest fractions (34.8%, 30.6%, 23.3% and 23.3%, respectively).

**Figure 4.**
Fraction of predicted driver genes presented in the HiConf list. HiConf includes 99 cancer genes that were manually curated by Kumar *et al.* (2015) through a literature search on OMIM and PubMed (Kumar *et al.*, 2015). DriverML, MutSigCV and DawnRank were the top three methods (overlap fractions of 30%, 27% and 23.3%, respectively).

**Figure 5.**
Fraction of predicted driver genes for each method by consensus among the methods. The average fraction of predicted driver genes for each method was determined by consensus among the other methods for the 31 datasets. Tools were sorted by the fraction of uniquely predicted drivers (indicated in red) from small to large. OncodriveCLUST was removed because it did not predicted any unique driver genes on 31 data sets. MEMo was also removed because it predicted too few genes. DriverML, MutSigCV and MDPFinder had the smallest fractions of uniquely predicted driver genes (2.5%, 5.2% and 10.8%, respectively). iPAC, SCS and CoMDP had the highest fractions (73.8%, 66.7% and 61.6%, respectively).

**Figure 6.**
*In vitro* assays of a novel driver gene *NPAT* predicted uniquely by DriverML. (A) The expression of *NPAT* in H520 and H1703 lung cancer cells transfected with siRNA by real-time PCR. (B) CCK-8 cell proliferation assay for lung cancer cells transfected with siRNA. (C) Invasion assay following knockdown of *NPAT* in lung cancer cells. (D) Colony formation assay in lung cancer cells transfected with *NPAT* siRNA or control siRNA. (E) Cell cycle profile of control and *NPAT* knockdown cells. (F) Western blot analysis of protein makers related to cell cycle in control and *NPAT* knockdown cells. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) protein is used as control. All cell assays were performed in triplicate. The error bars indicate SD of three independent experiments. **P < 0.01, ***P < 0.001 using the two-sided Student’s t test.

See this image and copyright information in PMC

References

1. Greenman C., Stephens P., Smith R., Dalgliesh G.L., Hunter C., Bignell G., Davies H., Teague J., Butler A., Stevens C. et al.. Patterns of somatic mutation in human cancer genomes. Nature. 2007; 446:153–158. - PMC - PubMed
1. Kumar R.D., Swamidass S.J., Bose R.. Unsupervised detection of cancer driver mutations with parsimony-guided learning. Nat. Genet. 2016; 48:1288–1294. - PMC - PubMed
1. Korthauer K.D., Kendziorski C.. MADGiC: a model-based approach for identifying driver genes in cancer. Bioinformatics. 2015; 31:1526–1535. - PMC - PubMed
1. Stratton M.R., Campbell P.J., Futreal P.A.. The cancer genome. Nature. 2009; 458:719–724. - PMC - PubMed
1. Meyerson M., Gabriel S., Getz G.. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 2010; 11:685–696. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Affiliations

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Authors

Affiliations

Erratum in

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources