Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Oct 6:9:416.
doi: 10.1186/1471-2105-9-416.

Knowledge-guided multi-scale independent component analysis for biomarker identification

Affiliations

Knowledge-guided multi-scale independent component analysis for biomarker identification

Li Chen et al. BMC Bioinformatics. .

Abstract

Background: Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data.

Results: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.

Conclusion: We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow chart of the proposed method – knowledge-guided multi-scale independent component analysis (ICA) – for biomarker identification.
Figure 2
Figure 2
Procedure of ten-fold cross-validation. The optimal number of clusters is determined by a nested ten-fold cross-validation on training gene set.
Figure 3
Figure 3
Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.
Figure 4
Figure 4
ROC curves of ten-fold cross-validation for four biomarker identification methods on training knowledge gene set of yeast cell cycle data set. Solid line represents the multi-scale ICA method; dash-dotted line represents the baseline ICA method; dotted line represents the correlation method-1; dash line represents the correlation method-2.
Figure 5
Figure 5
Average area under the curve (AUC) values using ten-fold cross-validation with different numbers of clusters on 104 knowledge genes. The knowledge-guided multi-scale ICA method is applied to yeast cell cycle data set for the identification of cell cycle-related genes.
Figure 6
Figure 6
ROC curves of four biomarker identification methods on yeast cell cycle data set with an independent test gene set.
Figure 7
Figure 7
Five cell cycle-related linear modes in the proposed multi-scale ICA approach on yeast cell cycle data set. The weight is also listed in the figure for each linear mode.
Figure 8
Figure 8
Histogram of determined optimal number of clusters in ten-fold cross- validation on ovarian cancer data set.
Figure 9
Figure 9
ROC curves of ten-fold cross-validation for four biomarker identification methods on knowledge gene set of ovarian cancer data set. Solid line represents the multi-scale ICA method; dash-dotted line represents the baseline ICA method; dotted line represents the correlation method-1; dash line represents the correlation method-2.
Figure 10
Figure 10
Average AUC values using ten-fold cross-validation across different numbers of clusters. The knowledge-guided multi-scale ICA method is applied to Rsf-1-induced ovarian cancer microarray data set for the identification of disease-specific biomarkers.
Figure 11
Figure 11
Estimated knowledge-related TFAs using baseline ICA method. X-axis represents the time and Y-axis represents the estimated TFAs.
Figure 12
Figure 12
Estimated four knowledge-related TFAs using the proposed muti-scale ICA method. X-axis represents the time and Y-axis represents the estimated TFAs.
Figure 13
Figure 13
Average p-value of TF enrichment for different gene sets associated with different methods on Rsf-1-induced ovarian cancer microarray data set.
Figure 14
Figure 14
TFs and their locations in 2 Kbp promoter region for top 10 genes selected by our approach. The promoter region is represented from -2,000 bp to 0 from TSS and each block in the figure represents a 100 bp region.
Figure 15
Figure 15
The network obtained from IPA with all of top 10 genes in Table 6. Five genes, FOSB, FOS, EGR1, IL8 and CDK2, are highly related to cancer module.

Similar articles

Cited by

References

    1. Devore J, Peck R. Statistics: The Exploration and Analysis of Data. CA Duxbury Press; 1997.
    1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. - DOI - PMC - PubMed
    1. Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW. Significance analysis of time course microarray experiments. Proc Natl Acad Sci USA. 2005;102:12837–12842. doi: 10.1073/pnas.0504609102. - DOI - PMC - PubMed
    1. Conesa A, Nueda MJ, Ferrer A, Talon M. maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics. 2006;22:1096–1102. doi: 10.1093/bioinformatics/btl056. - DOI - PubMed
    1. Hartigan JA, Wong MA. A K-means clustering algorithm. App Statist. 1978;28:100–108. doi: 10.2307/2346830. - DOI

Publication types

MeSH terms

LinkOut - more resources