Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 7;47(8):e45.
doi: 10.1093/nar/gkz096.

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Affiliations

DriverML: a machine learning algorithm for identifying driver genes in cancer sequencing studies

Yi Han et al. Nucleic Acids Res. .

Erratum in

Abstract

Although rapid progress has been made in computational approaches for prioritizing cancer driver genes, research is far from achieving the ultimate goal of discovering a complete catalog of genes truly associated with cancer. Driver gene lists predicted from these computational tools lack consistency and are prone to false positives. Here, we developed an approach (DriverML) integrating Rao's score test and supervised machine learning to identify cancer driver genes. The weight parameters in the score statistics quantified the functional impacts of mutations on the protein. To obtain optimized weight parameters, the score statistics of prior driver genes were maximized on pan-cancer training data. We conducted rigorous and unbiased benchmark analysis and comparisons of DriverML with 20 other existing tools in 31 independent datasets from The Cancer Genome Atlas (TCGA). Our comprehensive evaluations demonstrated that DriverML was robust and powerful among various datasets and outperformed the other tools with a better balance of precision and sensitivity. In vitro cell-based assays further proved the validity of the DriverML prediction of novel driver genes. In summary, DriverML uses an innovative, machine learning-based approach to prioritize cancer driver genes and provides dramatic improvements over currently existing methods. Its source code is available at https://github.com/HelloYiHan/DriverML.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Computational tools for identifying cancer driver genes. (A) Classification of 21 driver gene prediction tools evaluated in this study. These widely used tools are classified as frequency-based, hotspot-based and network-based methods. The block size for each tool represents its citation times according to data obtained from the Web Of Science on 27 September 2018 (the larger block size, the more the citation times). MutSigCV is a widely used tool that is the most frequently cited in the literature. It has the largest block size. Two up-to-date tools, rDriver and SCS (published in 2018), along with DriverML, had no citation, and had the smallest block size. (B) Summary of the main workflow of DriverML. DriverML identifies cancer driver genes by combining a weighted score test and machine learning approach. The weights (formula image, T represents the total number of mutation types evaluated in this study) in the score statistics quantify the functional impacts of different mutation types on the protein. To assign optimal weights to different types of non-silent mutations, the score statistics of prior driver genes were maximized in pan-cancer training data based on the machine learning approach. The U and I represent the Rao score function and Fish information, respectively. To test cancer driver genes, the score value of each gene was computed with the weighted score statistic with the learned weight parameters. The empirical null distribution of score statistics, from which P-values of tested genes were calculated, was generated by Monte Carlo simulation.
Figure 2.
Figure 2.
Fraction of predicted driver genes presented in CGC. The Cancer Gene Census (CGC) in COSMIC consists of 616 genes containing mutations that were associated with cancer. Overlap of the predicted driver genes with the CGC was evaluated. Tools were ordered by the median fraction of predicted drivers in the CGC. For each dataset, the fraction of tools predicting too few genes (<3) was set to zero in case of an abnormally high fraction of overlap. Thirty-one datasets as a whole, DriverML, MutSigCV, DawnRank and rDriver had the highest fractions (42.9%, 33.3%, 33.3% and 33.3%, respectively).
Figure 3.
Figure 3.
Fraction of predicted genes presented in the list of Mut-driver genes. The list of Mut-driver included 125 genes that were identified from 3284 tumors according to the mutation pattern of 20/20 rule (Vogelstein et al., 2013). DriverML, MutSigCV, DawnRank and rDriver had the highest fractions (34.8%, 30.6%, 23.3% and 23.3%, respectively).
Figure 4.
Figure 4.
Fraction of predicted driver genes presented in the HiConf list. HiConf includes 99 cancer genes that were manually curated by Kumar et al. (2015) through a literature search on OMIM and PubMed (Kumar et al., 2015). DriverML, MutSigCV and DawnRank were the top three methods (overlap fractions of 30%, 27% and 23.3%, respectively).
Figure 5.
Figure 5.
Fraction of predicted driver genes for each method by consensus among the methods. The average fraction of predicted driver genes for each method was determined by consensus among the other methods for the 31 datasets. Tools were sorted by the fraction of uniquely predicted drivers (indicated in red) from small to large. OncodriveCLUST was removed because it did not predicted any unique driver genes on 31 data sets. MEMo was also removed because it predicted too few genes. DriverML, MutSigCV and MDPFinder had the smallest fractions of uniquely predicted driver genes (2.5%, 5.2% and 10.8%, respectively). iPAC, SCS and CoMDP had the highest fractions (73.8%, 66.7% and 61.6%, respectively).
Figure 6.
Figure 6.
In vitro assays of a novel driver gene NPAT predicted uniquely by DriverML. (A) The expression of NPAT in H520 and H1703 lung cancer cells transfected with siRNA by real-time PCR. (B) CCK-8 cell proliferation assay for lung cancer cells transfected with siRNA. (C) Invasion assay following knockdown of NPAT in lung cancer cells. (D) Colony formation assay in lung cancer cells transfected with NPAT siRNA or control siRNA. (E) Cell cycle profile of control and NPAT knockdown cells. (F) Western blot analysis of protein makers related to cell cycle in control and NPAT knockdown cells. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) protein is used as control. All cell assays were performed in triplicate. The error bars indicate SD of three independent experiments. **P < 0.01, ***P < 0.001 using the two-sided Student’s t test.

References

    1. Greenman C., Stephens P., Smith R., Dalgliesh G.L., Hunter C., Bignell G., Davies H., Teague J., Butler A., Stevens C. et al. .. Patterns of somatic mutation in human cancer genomes. Nature. 2007; 446:153–158. - PMC - PubMed
    1. Kumar R.D., Swamidass S.J., Bose R.. Unsupervised detection of cancer driver mutations with parsimony-guided learning. Nat. Genet. 2016; 48:1288–1294. - PMC - PubMed
    1. Korthauer K.D., Kendziorski C.. MADGiC: a model-based approach for identifying driver genes in cancer. Bioinformatics. 2015; 31:1526–1535. - PMC - PubMed
    1. Stratton M.R., Campbell P.J., Futreal P.A.. The cancer genome. Nature. 2009; 458:719–724. - PMC - PubMed
    1. Meyerson M., Gabriel S., Getz G.. Advances in understanding cancer genomes through second-generation sequencing. Nat. Rev. Genet. 2010; 11:685–696. - PubMed

Publication types

MeSH terms