Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jul 3:7:21.
doi: 10.1186/1745-6150-7-21.

Pathway-based classification of cancer subtypes

Affiliations

Pathway-based classification of cancer subtypes

Shinuk Kim et al. Biol Direct. .

Abstract

Background: Molecular markers based on gene expression profiles have been used in experimental and clinical settings to distinguish cancerous tumors in stage, grade, survival time, metastasis, and drug sensitivity. However, most significant gene markers are unstable (not reproducible) among data sets. We introduce a standardized method for representing cancer markers as 2-level hierarchical feature vectors, with a basic gene level as well as a second level of (more stable) pathway markers, for the purpose of discriminating cancer subtypes. This extends standard gene expression arrays with new pathway-level activation features obtained directly from off-the-shelf gene set enrichment algorithms such as GSEA. Such so-called pathway-based expression arrays are significantly more reproducible across datasets. Such reproducibility will be important for clinical usefulness of genomic markers, and augment currently accepted cancer classification protocols.

Results: The present method produced more stable (reproducible) pathway-based markers for discriminating breast cancer metastasis and ovarian cancer survival time. Between two datasets for breast cancer metastasis, the intersection of standard significant gene biomarkers totaled 7.47% of selected genes, compared to 17.65% using pathway-based markers; the corresponding percentages for ovarian cancer datasets were 20.65% and 33.33% respectively. Three pathways, consisting of Type_1_diabetes mellitus, Cytokine-cytokine_receptor_interaction and Hedgehog_signaling (all previously implicated in cancer), are enriched in both the ovarian long survival and breast non-metastasis groups. In addition, integrating pathway and gene information, we identified five (ID4, ANXA4, CXCL9, MYLK, FBXL7) and six (SQLE, E2F1, PTTG1, TSTA3, BUB1B, MAD2L1) known cancer genes significant for ovarian and breast cancer respectively.

Conclusions: Standardizing the analysis of genomic data in the process of cancer staging, classification and analysis is important as it has implications for both pre-clinical as well as clinical studies. The paradigm of diagnosis and prediction using pathway-based biomarkers as features can be an important part of the process of biomarker-based cancer analysis, and the resulting canonical (clinically reproducible) biomarkers can be important in standardizing genomic data. We expect that identification of such canonical biomarkers will improve clinical utility of high-throughput datasets for diagnostic and prognostic applications.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic diagram of the methods. Black arrows represent the workflow of the GPF method, which used pathway features generated by combinations of GSEA leading edge genes and SVM gene weights. Red arrows represent the GLEG method, which directly classified using leading edge genes obtained from GSEA. Black dashed arrows represent workflow shared between GPF and GLEG. The SPF method used the same workflow as the GPF method with the replacement of pathway selections using GSEA by pathway selections using SVM. The train SVM step uses separate training features for the GPF and the GLEG workflows.
Figure 2
Figure 2
Evaluation of the performances of different gene sets. These implementations include 200 KEGG pathways (KEGG), 522 functional pathways (C2), 200 KEGG pathways with 5 curated ovarian cancer associated gene sets (KEGG_Ovary), and 522 functional pathways with 5 curated ovarian cancer-associated gene sets (C2_Ovary). All test data sets are extracted from primary ovarian cancer tissue. UNC_sur and BI_sur denote ovarian survival data sets analyzed from UNC and BI respectively, and Platinum denotes Platinum response data sets.
Figure 3
Figure 3
Comparison of the performances of the RPF, GLEG, GPF, SPF, SKG, and SG methods. These methods are tested in ovarian cancer data sets to discriminate survival time (SUR) and stage (stage). The notation represents use of the following features: RPF, random pathway features; GLEG, leading edge genes using GSEA; GPF, pathway features selected using GSEA; SPF, pathway features selected using SVM; SKG, single KEGG genes; and SG, single genes.
Figure 4
Figure 4
Comparison of metastasis prediction performances based on the Wang and van de Vijver data sets. Each data set tested 10 combinations of data subsamplings (for the purpose of balancing the data) with leave one out cross validation in each of the 10. The vertical axis shows the average accuracy. Here the RPF, GLEG, GPF, SKG, and SG methods were used.
Figure 5
Figure 5
Cross-validation between two different cohorts. The accuracies of cross-validation between the Wang and van de Vijver data sets and between the UNC and BI survival time data sets. The arrow denotes that genes were selected from training data sets of one cohort and tested the genes in the other cohort. For example, Wang -- > Vijver means that genes were selected from the Wang data sets and tested in the van de Vijver data set.
Figure 6
Figure 6
Average of accuracies with respect to different methods. The average of accuracies using the various methods for cross-validation between different data sets, such as the Wang and van de Vijver sets and the UNC and BI sets. The vertical axis represents averaged accuracies over all heights in the previous graph (Figure 5).
Figure 7
Figure 7
Agreements of different types of significant markers. The agreements of three different types of significance markers between two data sets: ‘GPF classifier’ denotes pathway features obtained by GSEA, and ‘GLEG classifier’ denotes leading edge gene markers determined by GSEA. ‘SG classifier’ and ‘SKG classifier’ denote the genes determined by Fisher selection from full gene expression profiles and restricted to the set of KEGG pathway genes, respectively, with feature numbers controlled to 20% of each population. ‘SG_random’ denotes gene sets selected randomly (again to 20% of the full gene set) from full gene expression profiles. ‘Ovarian_Stage’ denotes the ovarian stage data sets (marker stability compared between BI and UNC data), ‘Ovarian_SUR’ denotes ovarian survival data sets (BI vs. UNC), and ‘Metastasis’ denotes metastatic breast cancer data sets (based on the Wang and van de Vijver data sets). For each pair of datasets, overlapping biomarkers were all extracted from matching based on the top 40 pathways in each. Vertical axis represents biomarker consistency as the quotient formed by the size of the intersection of the two biomarker sets, divided by the size of their union.
Figure 8
Figure 8
Consistency of different classes of biomarkers with respect to numbers of candidate pathways. The consistency (overlap level) of different types of top gene/pathway markers based on varying numbers of selected candidate pathways (40, 30, or 20) between UNC and BI ovarian survival data sets. For example, the 30 pathway_sur column heights represent overlap percentages of biomarkers in the survival data sets from BI and UNC (using pathway, leading edge gene, and single gene biomarkers, respectively, all extracted from matching based on the top 30 pathways in the BI and UNC datasets). Vertical axis is defined as in Figure 7.

Similar articles

Cited by

References

    1. Crijns APG, Fehrmann RSN, de Jong S, Gerbens F, Meersma GJ, Klip HG, Hollema H, Hofstra RMW, Meerman GJT, de Vries EGE. et al.Survival-Related Profile, Pathways, and Transcription Factors in Ovarian Cancer. PLoS Med. 2009;6(2):181–193. - PMC - PubMed
    1. Dressman HK, Berchuck A, Chan G, Zhai J, Bild A, Sayer R, Cragun J, Clarke J, Whitaker RS, Li LH. et al.An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. J Clin Oncol. 2007;25(5):517–525. doi: 10.1200/JCO.2006.06.3743. - DOI - PubMed
    1. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–679. - PubMed
    1. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT. et al.Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. - DOI - PubMed
    1. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3 doi: 10.1038/msb4100180. - DOI - PMC - PubMed

Publication types

MeSH terms