. 2008 Oct 6:9:416.

doi: 10.1186/1471-2105-9-416.

Knowledge-guided multi-scale independent component analysis for biomarker identification

Li Chen¹, Jianhua Xuan, Chen Wang, Ie-Ming Shih, Yue Wang, Zhen Zhang, Eric Hoffman, Robert Clarke

Affiliations

PMID: 18837990
PMCID: PMC2576264
DOI: 10.1186/1471-2105-9-416

Knowledge-guided multi-scale independent component analysis for biomarker identification

Li Chen et al. BMC Bioinformatics. 2008.

. 2008 Oct 6:9:416.

doi: 10.1186/1471-2105-9-416.

Authors

Li Chen¹, Jianhua Xuan, Chen Wang, Ie-Ming Shih, Yue Wang, Zhen Zhang, Eric Hoffman, Robert Clarke

Affiliation

¹ Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA. lchen06@vt.edu

PMID: 18837990
PMCID: PMC2576264
DOI: 10.1186/1471-2105-9-416

Abstract

Background: Many statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data.

Results: Since gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.

Conclusion: We have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.

PubMed Disclaimer

Figures

**Figure 1**
Flow chart of the proposed method – knowledge-guided multi-scale independent component analysis (ICA) – for biomarker identification.

**Figure 2**
**Procedure of ten-fold cross-validation.** The optimal number of clusters is determined by a nested ten-fold cross-validation on training gene set.

**Figure 3**
Histogram of determined optimal number of clusters in ten-fold cross- validation on yeast cell cycle data set.

**Figure 4**
**ROC curves of ten-fold cross-validation for four biomarker identification methods on training knowledge gene set of yeast cell cycle data set.** Solid line represents the multi-scale ICA method; dash-dotted line represents the baseline ICA method; dotted line represents the correlation method-1; dash line represents the correlation method-2.

**Figure 5**
**Average area under the curve (AUC) values using ten-fold cross-validation with different numbers of clusters on 104 knowledge genes.** The knowledge-guided multi-scale ICA method is applied to yeast cell cycle data set for the identification of cell cycle-related genes.

**Figure 6**
ROC curves of four biomarker identification methods on yeast cell cycle data set with an independent test gene set.

**Figure 7**
**Five cell cycle-related linear modes in the proposed multi-scale ICA approach on yeast cell cycle data set.** The weight is also listed in the figure for each linear mode.

**Figure 8**
Histogram of determined optimal number of clusters in ten-fold cross- validation on ovarian cancer data set.

**Figure 9**
**ROC curves of ten-fold cross-validation for four biomarker identification methods on knowledge gene set of ovarian cancer data set.** Solid line represents the multi-scale ICA method; dash-dotted line represents the baseline ICA method; dotted line represents the correlation method-1; dash line represents the correlation method-2.

**Figure 10**
**Average AUC values using ten-fold cross-validation across different numbers of clusters.** The knowledge-guided multi-scale ICA method is applied to Rsf-1-induced ovarian cancer microarray data set for the identification of disease-specific biomarkers.

**Figure 11**
**Estimated knowledge-related TFAs using baseline ICA method.** X-axis represents the time and Y-axis represents the estimated TFAs.

**Figure 12**
**Estimated four knowledge-related TFAs using the proposed muti-scale ICA method.** X-axis represents the time and Y-axis represents the estimated TFAs.

**Figure 13**
Average p-value of TF enrichment for different gene sets associated with different methods on Rsf-1-induced ovarian cancer microarray data set.

**Figure 14**
**TFs and their locations in 2 Kbp promoter region for top 10 genes selected by our approach.** The promoter region is represented from -2,000 bp to 0 from TSS and each block in the figure represents a 100 bp region.

**Figure 15**
**The network obtained from IPA with all of top 10 genes in Table 6.** Five genes, FOSB, FOS, EGR1, IL8 and CDK2, are highly related to cancer module.

See this image and copyright information in PMC

Cited by

Glycated lysine-141 in haptoglobin improves the diagnostic accuracy for type 2 diabetes mellitus in combination with glycated hemoglobin HbA_1c and fasting plasma glucose.
Spiller S, Li Y, Blüher M, Welch L, Hoffmann R. Spiller S, et al. Clin Proteomics. 2017 Mar 28;14:10. doi: 10.1186/s12014-017-9145-1. eCollection 2017. Clin Proteomics. 2017. PMID: 28360826 Free PMC article.
ADAGE signature analysis: differential expression analysis with data-defined gene sets.
Tan J, Huyck M, Hu D, Zelaya RA, Hogan DA, Greene CS. Tan J, et al. BMC Bioinformatics. 2017 Nov 22;18(1):512. doi: 10.1186/s12859-017-1905-4. BMC Bioinformatics. 2017. PMID: 29166858 Free PMC article.
A minimal connected network of transcription factors regulated in human tumors and its application to the quest for universal cancer biomarkers.
Essaghir A, Demoulin JB. Essaghir A, et al. PLoS One. 2012;7(6):e39666. doi: 10.1371/journal.pone.0039666. Epub 2012 Jun 25. PLoS One. 2012. PMID: 22761861 Free PMC article.
Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks.
Tan J, Doing G, Lewis KA, Price CE, Chen KM, Cady KC, Perchuk B, Laub MT, Hogan DA, Greene CS. Tan J, et al. Cell Syst. 2017 Jul 26;5(1):63-71.e6. doi: 10.1016/j.cels.2017.06.003. Epub 2017 Jul 12. Cell Syst. 2017. PMID: 28711280 Free PMC article.
Independent component analysis: mining microarray data for fundamental human gene expression modules.
Engreitz JM, Daigle BJ Jr, Marshall JJ, Altman RB. Engreitz JM, et al. J Biomed Inform. 2010 Dec;43(6):932-44. doi: 10.1016/j.jbi.2010.07.001. Epub 2010 Jul 7. J Biomed Inform. 2010. PMID: 20619355 Free PMC article.

See all "Cited by" articles

References

1. Devore J, Peck R. Statistics: The Exploration and Analysis of Data. CA Duxbury Press; 1997.
1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. - DOI - PMC - PubMed
1. Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW. Significance analysis of time course microarray experiments. Proc Natl Acad Sci USA. 2005;102:12837–12842. doi: 10.1073/pnas.0504609102. - DOI - PMC - PubMed
1. Conesa A, Nueda MJ, Ferrer A, Talon M. maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments. Bioinformatics. 2006;22:1096–1102. doi: 10.1093/bioinformatics/btl056. - DOI - PubMed
1. Hartigan JA, Wong MA. A K-means clustering algorithm. App Statist. 1978;28:100–108. doi: 10.2307/2346830. - DOI

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Knowledge-guided multi-scale independent component analysis for biomarker identification

Affiliation

Knowledge-guided multi-scale independent component analysis for biomarker identification

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous