Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec 7;4(12):e8161.
doi: 10.1371/journal.pone.0008161.

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Affiliations

Accurate and reliable cancer classification based on probabilistic inference of pathway activity

Junjie Su et al. PLoS One. .

Abstract

With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Probabilistic inference of pathway activity.
For each gene in the pathway, we estimate the conditional probability density functions (PDFs) under different phenotypes. Based on the estimated PDFs, we transform the expression values of the member genes into log-likelihood ratios (LLRs) to obtain a LLR matrix from the gene expression matrix. The LLR matrix is then normalized, and the pathway activity is inferred by combining the normalized LLRs of its member genes.
Figure 2
Figure 2. Illustration of the experimental set-up.
(A) In the within-dataset experiments, part of the training set, referred as the marker-evaluation set, is used for ranking the pathway markers according to their discriminative power and building the classifier. The optimal set of features are selected based on the remainder of the training set, referred as the feature-selection set. The performance of the resulting classifier is evaluated using the test dataset. (B) In the cross-dataset experiments, one of the datasets is used to find the optimal set of features, and the other dataset is used to build a classifier based on the preselected features and to evaluate the classifier.
Figure 3
Figure 3. Discriminative power of prescreened pathway markers and single gene markers.
(A) Mean absolute formula image-score of the top formula image markers for the Netherlands breast cancer dataset. Pathway activities have been inferred using five different methods: CORG, PCA, mean, median, and LLR (proposed method). The discriminative power of the top gene markers was estimated for comparison (labeled as “Gene”). (B) Mean absolute formula image-score of the top markers for the USA breast cancer dataset. (C) The markers were ranked based on the Netherlands dataset and the mean absolute formula image-score of the top formula image markers was computed based on the USA dataset. (D) The markers were ranked based on the USA dataset and the mean absolute formula image-score of the top markers was computed based on the Netherlands dataset.
Figure 4
Figure 4. Discriminative power of all pathway markers and gene markers.
(A) Mean absolute formula image-score of the top formula image markers for the Netherlands dataset. (B) Mean absolute formula image-score of the top markers for the USA dataset. (C) The markers were ranked based on the Netherlands dataset and the mean absolute formula image-score of the top formula image markers was computed based on the USA dataset. (D) The markers were ranked based on the USA dataset and the mean score of the top formula image markers was computed based on the Netherlands dataset.
Figure 5
Figure 5. Performance of different classification methods.
The bar charts show the average AUCs for different classification methods. Five pathway-based methods that use distinct pathway activity inference schemes (LLR, CORG, PCA, mean, and median) and a gene-based method were compared. (A) Classifiers were constructed based on logistic regression. Results of within-dataset experiments based on the USA and Netherlands datasets are shown in the two charts on the left. The two charts on the right show the results of the cross-dataset experiments. (B) The performance of different classification methods based on LDA (linear discriminant analysis).
Figure 6
Figure 6. Performance of different classification methods.
The bar charts show the average AUCs of within-dataset experiments for five pathway-based methods (LLR, CORG, PCA, mean, and median) and a gene-based method. In these experiments, the top 50 pathways have been reselected in every experiment using the designated training set. (A) Classification results based on logistic regression. (B) Classification results based on LDA (linear discriminant analysis).
Figure 7
Figure 7. Robustness of the proposed classification scheme.
To assess the robustness of the proposed classification scheme, two-fold cross-validation experiments have been performed, where we measured the change in classification error after interchanging the training and test sets. (A) Cumulative distribution of the error difference for the USA dataset. (B) Cumulative distribution of the error difference for the Netherlands dataset.

Similar articles

Cited by

References

    1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. - PubMed
    1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. - PubMed
    1. Perez-Diez A, Morgun A, Shulzhenko N. Microarrays for cancer diagnosis and classification. Adv Exp Med Biol. 2007;593:74–85. - PubMed
    1. Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003;33:49–54. - PubMed
    1. Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002;23:70–86. - PubMed

Publication types

Substances