. 2009 Dec 7;4(12):e8161.

doi: 10.1371/journal.pone.0008161.

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Junjie Su¹, Byung-Jun Yoon, Edward R Dougherty

Affiliations

PMID: 19997592
PMCID: PMC2781165
DOI: 10.1371/journal.pone.0008161

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Junjie Su et al. PLoS One. 2009.

. 2009 Dec 7;4(12):e8161.

doi: 10.1371/journal.pone.0008161.

Authors

Junjie Su¹, Byung-Jun Yoon, Edward R Dougherty

Affiliation

¹ Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, United States of America.

PMID: 19997592
PMCID: PMC2781165
DOI: 10.1371/journal.pone.0008161

Abstract

With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Probabilistic inference of pathway activity.**
For each gene in the pathway, we estimate the conditional probability density functions (PDFs) under different phenotypes. Based on the estimated PDFs, we transform the expression values of the member genes into log-likelihood ratios (LLRs) to obtain a LLR matrix from the gene expression matrix. The LLR matrix is then normalized, and the pathway activity is inferred by combining the normalized LLRs of its member genes.

**Figure 2. Illustration of the experimental set-up.**
(A) In the within-dataset experiments, part of the training set, referred as the marker-evaluation set, is used for ranking the pathway markers according to their discriminative power and building the classifier. The optimal set of features are selected based on the remainder of the training set, referred as the feature-selection set. The performance of the resulting classifier is evaluated using the test dataset. (B) In the cross-dataset experiments, one of the datasets is used to find the optimal set of features, and the other dataset is used to build a classifier based on the preselected features and to evaluate the classifier.

**Figure 3. Discriminative power of prescreened pathway markers and single gene markers.**
(A) Mean absolute -score of the top markers for the Netherlands breast cancer dataset. Pathway activities have been inferred using five different methods: CORG, PCA, mean, median, and LLR (proposed method). The discriminative power of the top gene markers was estimated for comparison (labeled as “Gene”). (B) Mean absolute -score of the top markers for the USA breast cancer dataset. (C) The markers were ranked based on the Netherlands dataset and the mean absolute -score of the top markers was computed based on the USA dataset. (D) The markers were ranked based on the USA dataset and the mean absolute -score of the top markers was computed based on the Netherlands dataset.

formula image — **Figure 3. Discriminative power of prescreened pathway markers and single gene markers.**
(A) Mean absolute -score of the top markers for the Netherlands breast cancer dataset. Pathway activities have been inferred using five different methods: CORG, PCA, mean, median, and LLR (proposed method). The discriminative power of the top gene markers was estimated for comparison (labeled as “Gene”). (B) Mean absolute -score of the top markers for the USA breast cancer dataset. (C) The markers were ranked based on the Netherlands dataset and the mean absolute -score of the top markers was computed based on the USA dataset. (D) The markers were ranked based on the USA dataset and the mean absolute -score of the top markers was computed based on the Netherlands dataset.

**Figure 4. Discriminative power of all pathway markers and gene markers.**
(A) Mean absolute -score of the top markers for the Netherlands dataset. (B) Mean absolute -score of the top markers for the USA dataset. (C) The markers were ranked based on the Netherlands dataset and the mean absolute -score of the top markers was computed based on the USA dataset. (D) The markers were ranked based on the USA dataset and the mean score of the top markers was computed based on the Netherlands dataset.

**Figure 5. Performance of different classification methods.**
The bar charts show the average AUCs for different classification methods. Five pathway-based methods that use distinct pathway activity inference schemes (LLR, CORG, PCA, mean, and median) and a gene-based method were compared. (A) Classifiers were constructed based on logistic regression. Results of within-dataset experiments based on the USA and Netherlands datasets are shown in the two charts on the left. The two charts on the right show the results of the cross-dataset experiments. (B) The performance of different classification methods based on LDA (linear discriminant analysis).

**Figure 6. Performance of different classification methods.**
The bar charts show the average AUCs of within-dataset experiments for five pathway-based methods (LLR, CORG, PCA, mean, and median) and a gene-based method. In these experiments, the top 50 pathways have been reselected in every experiment using the designated training set. (A) Classification results based on logistic regression. (B) Classification results based on LDA (linear discriminant analysis).

**Figure 7. Robustness of the proposed classification scheme.**
To assess the robustness of the proposed classification scheme, two-fold cross-validation experiments have been performed, where we measured the change in classification error after interchanging the training and test sets. (A) Cumulative distribution of the error difference for the USA dataset. (B) Cumulative distribution of the error difference for the Netherlands dataset.

See this image and copyright information in PMC

Cited by

The cure: design and evaluation of a crowdsourcing game for gene selection for breast cancer survival prediction.
Good BM, Loguercio S, Griffith OL, Nanis M, Wu C, Su AI. Good BM, et al. JMIR Serious Games. 2014 Jul 29;2(2):e7. doi: 10.2196/games.3350. JMIR Serious Games. 2014. PMID: 25654473 Free PMC article.
Robustness evaluations of pathway activity inference methods on gene expression data.
Hui TX, Kasim S, Aziz IA, Fudzee MFM, Haron NS, Sutikno T, Hassan R, Mahdin H, Sen SC. Hui TX, et al. BMC Bioinformatics. 2024 Jan 12;25(1):23. doi: 10.1186/s12859-024-05632-w. BMC Bioinformatics. 2024. PMID: 38216898 Free PMC article.
Clustering gene expression regulators: new approach to disease subtyping.
Pyatnitskiy M, Mazo I, Shkrob M, Schwartz E, Kotelnikova E. Pyatnitskiy M, et al. PLoS One. 2014 Jan 9;9(1):e84955. doi: 10.1371/journal.pone.0084955. eCollection 2014. PLoS One. 2014. PMID: 24416320 Free PMC article.
Bayesian Gene Selection Based on Pathway Information and Network-Constrained Regularization.
Cao M, Fan Y, Peng Q. Cao M, et al. Comput Math Methods Med. 2021 Aug 4;2021:7471516. doi: 10.1155/2021/7471516. eCollection 2021. Comput Math Methods Med. 2021. PMID: 34394707 Free PMC article.
Pathway-based analyses of gene expression profiles at low doses of ionizing radiation.
Luo X, Niyakan S, Johnstone P, McCorkle S, Park G, López-Marrero V, Yoo S, Dougherty ER, Qian X, Alexander FJ, Jha S, Yoon BJ. Luo X, et al. Front Bioinform. 2024 May 14;4:1280971. doi: 10.3389/fbinf.2024.1280971. eCollection 2024. Front Bioinform. 2024. PMID: 38812660 Free PMC article.

See all "Cited by" articles

References

1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. - PubMed
1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. - PubMed
1. Perez-Diez A, Morgun A, Shulzhenko N. Microarrays for cancer diagnosis and classification. Adv Exp Med Biol. 2007;593:74–85. - PubMed
1. Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003;33:49–54. - PubMed
1. Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002;23:70–86. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Affiliation

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources