Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 15;31(12):i311-9.
doi: 10.1093/bioinformatics/btv255.

FERAL: network-based classifier with application to breast cancer outcome prediction

Affiliations

FERAL: network-based classifier with application to breast cancer outcome prediction

Amin Allahyar et al. Bioinformatics. .

Abstract

Motivation: Breast cancer outcome prediction based on gene expression profiles is an important strategy for personalize patient care. To improve performance and consistency of discovered markers of the initial molecular classifiers, network-based outcome prediction methods (NOPs) have been proposed. In spite of the initial claims, recent studies revealed that neither performance nor consistency can be improved using these methods. NOPs typically rely on the construction of meta-genes by averaging the expression of several genes connected in a network that encodes protein interactions or pathway information. In this article, we expose several fundamental issues in NOPs that impede on the prediction power, consistency of discovered markers and obscures biological interpretation.

Results: To overcome these issues, we propose FERAL, a network-based classifier that hinges upon the Sparse Group Lasso which performs simultaneous selection of marker genes and training of the prediction model. An important feature of FERAL, and a significant departure from existing NOPs, is that it uses multiple operators to summarize genes into meta-genes. This gives the classifier the opportunity to select the most relevant meta-gene for each gene set. Extensive evaluation revealed that the discovered markers are markedly more stable across independent datasets. Moreover, interpretation of the marker genes detected by FERAL reveals valuable mechanistic insight into the etiology of breast cancer.

Availability and implementation: All code is available for download at: http://homepage.tudelft.nl/53a60/resources/FERAL/FERAL.zip.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the proposed model (FERAL). (a) Current models follow a similar path in which several nearby genes (according to a given network) are selected and then integrated using an average operator resulting in a meta-gene. These meta-genes are then ranked based on a pre-defined scoring function and top candidates are presented to the final classifier. (b) Instead of being limited to average-based meta-genes, FERAL computes several meta-genes using different operators and employs the SGL to select the most appropriate meta-gene for each specific gene set while simultaneously performing selection, integration and classification
Fig. 2.
Fig. 2.
Evaluation of different integration operators. (a) Visualization of the consistency in the direction of association with the target label for connected gene pairs in the I2D network. The x-axis represents the magnitude of difference, defined as abs(CaCb)×Sgn(Ca×Cb), where Cx denotes the correlation between gene x and the target label and Sgn is sign function. The y-axis is the correlation between two genes (see Supplementary Section S3 for details). (b) Performance comparison between 11 operators including (from left to right): average, average of differences between seed gene and its interactors (implemented in Taylor), variance, minimum, maximum, median, regression, lasso, DA2, Decision Tree (DT) and support vector machine with an RBF kernel. To generate each violin plot, 5000 randomly selected seed genes and their 9 closest neighbors according to the I2D network were integrated into a meta-gene using one of the operators, and the predictive performance (AUC) is determined. The y-axis represents the improvement log ratio of the AUC obtained with the meta-gene with the highest AUC of the individual genes. This comparison shows that other operators are able to provide similar or even better performance compared with average operator. Interestingly, adjusting the direction of genes before taking the average can improve the performance considerably
Fig. 3.
Fig. 3.
Schematic of the training and testing procedures of FERAL. (a) In the first step, 10 genes are selected using given network. (b) Corresponding genes in expression dataset are selected and normalized using z-score. (c) Meta-genes are computed using the expression profiles of the gene set and target label (in case of a supervised integration). The expression of the individual genes is retained within the gene set. (d) The SGL is trained using training samples. (e) Test samples are used to assess the prediction performance (in terms of AUC) in the current fold
Fig. 4.
Fig. 4.
Performance evaluation (AUC). Performance of the methods under study for the PPI network (I2D), a co-expression network (Co-Expr) and a random network (Random). We also added the result when a classical Lasso is employed (Single). Error bars denote the 95% confidence interval. The heatmaps indicate the P value of the paired t-test between pairwise comparison of the AUCs of the individual CV folds. (a) Sub-type stratified CV. (b) Sampled leave-one-study-out CV
Fig. 5.
Fig. 5.
Stability measurement (using Fisher’s exact test) for three different networks including I2D, Co-Expr and random network. The original version of the standard methods produced a much a lower overlap between folds due to pre-ranking of meta-genes. Similarly, Lasso produced a low overlap due to random selection of correlated features. FERAL obtained a higher gene set stability across folds for the I2D and Co-Expr network
Fig. 6.
Fig. 6.
Gene enrichment. (a) Gene enrichment of top genes for each method when the I2D network is employed. The values on top of each group represent the number of genes in each gene set. A notably increased enrichment is obtained using the gene sets produced by FERAL. (b) Result of top 15 gene enrichments by BiNGO applied to top 400 genes provided by FERAL
Fig. 7.
Fig. 7.
Frequently identified gene sets by FERAL. The bars represent the median coefficient across folds, normalized to the range {1,1}. Background colors indicate the correlation with target label ranging from positive (blue) to negative (red)

References

    1. Albert R. (2005) Scale-free networks in cell biology. J. Cell Sci. , 118, 4947–4957. - PubMed
    1. Babaei S., et al. (2011) Integrating protein family sequence similarities with gene expression to find signature gene networks in breast cancer metastasis. In: Loog M., et al. (eds), 6th IAPR International Conference, Pattern Recognition in Bioinformatics (PRIB). Springer-Verlag Berlin Heidelberg, Delft, The Netherlands, pp. 247–259.
    1. Chen G., et al. (2002) Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Stat. Sin. , 12, 241–262.
    1. Cheng W., et al. (2014) Graph-regularized dual lasso for robust eqtl mapping. Bioinformatics , 30, i139–i148. - PMC - PubMed
    1. Chuang H.-Y., et al. (2007) Network-based classification of breast cancer metastasis. Mol. Syst. Biol. , 3, 140. - PMC - PubMed

Publication types