Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 15;27(12):1595-602.
doi: 10.1093/bioinformatics/btr193. Epub 2011 Apr 14.

PathScan: a tool for discerning mutational significance in groups of putative cancer genes

Affiliations

PathScan: a tool for discerning mutational significance in groups of putative cancer genes

Michael C Wendl et al. Bioinformatics. .

Abstract

Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.

Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher-Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.

Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.

Availability: PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Floating-point operations as a function of the number of observed mutations for a test set of m = 60 genes for exact (dashed curves) and approximate (solid curves) solutions. Given Equation (5), all curves are symmetric about m / 2, so the plot only shows data up to k = 30.
Fig. 2.
Fig. 2.
Percent overprediction of P-values from representative small (m = 18, solid curves) and large (m = 60, dashed curves) gene sets. Four scenarios are considered: j = 1 bin with background mutation rate of ρ = 1/Mb (circles), j = 1 and ρ = 3 (diamonds), j = 3 and ρ = 1 (triangles), and j = 3 and ρ = 3 (squares). Test sets were generated with randomly selected lengths between 200 and 15 000 nt.
Fig. 3.
Fig. 3.
Estimated statistical power as a function of test set size for α = 1%. Solid and dashed curves represent assumed cancer mutation rates of 2-fold and 5-fold higher than the background rate (ρ = 3/Mb), respectively. Dotted curves denote extrapolation beyond computational limits of the 2-fold results based on least–squares fitting (Supplemental Information). All calculations were made using the 3-bin approximate solution on randomly generated test sets having gene lengths between 200 bp and 15 kb. Each datum indicates the average of 100 such trials.
Fig. 4.
Fig. 4.
Two mutation scenarios for (n, m, b, k) = (2, 4, 0.5, 6). The top panel represents a 4 + 2 configuration, i.e. all 4 genes mutated in one sample (circles) and only two genes mutated in the other sample (triangles), while the bottom panel is a 3 + 3 configuration. Pooled statistics are unable to distinguish between these two scenarios, even though their significance values are appreciably different, i.e. 0.047 for the top panel and 0.063 for the bottom (Wallis, 1942).

References

    1. Bateman A, et al. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266. - PMC - PubMed
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300.
    1. Berger MF, et al. The genomic complexity of primary human prostate cancer. Nature. 2011;470:214–220. - PMC - PubMed
    1. Beroukhim R, et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc. Natl Acad. Sci. USA. 2007;104:20007–20012. - PMC - PubMed
    1. Brown LD, et al. Interval estimation for a binomial proportion. Stat. Sci. 2001;16:101–133.

Publication types