Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun 25:7:49.
doi: 10.1186/1752-0509-7-49.

Mining breast cancer genes with a network based noise-tolerant approach

Affiliations

Mining breast cancer genes with a network based noise-tolerant approach

Yaling Nie et al. BMC Syst Biol. .

Abstract

Background: Mining novel breast cancer genes is an important task in breast cancer research. Many approaches prioritize candidate genes based on their similarity to known cancer genes, usually by integrating multiple data sources. However, different types of data often contain varying degrees of noise. For effective data integration, it's important to design methods that work robustly with respect to noise.

Results: Gene Ontology (GO) annotations were often utilized in cancer gene mining works. However, the vast majority of GO annotations were computationally derived, thus not completely accurate. A set of genes annotated with breast cancer enriched GO terms was adopted here as a set of source data with realistic noise. A novel noise tolerant approach was proposed to rank candidate breast cancer genes using noisy source data within the framework of a comprehensive human Protein-Protein Interaction (PPI) network. Performance of the proposed method was quantitatively evaluated by comparing it with the more established random walk approach. Results showed that the proposed method exhibited better performance in ranking known breast cancer genes and higher robustness against data noise than the random walk approach. When noise started to increase, the proposed method was able to maintained relatively stable performance, while the random walk approach showed drastic performance decline; when noise increased to a large extent, the proposed method was still able to achieve better performance than random walk did.

Conclusions: A novel noise tolerant method was proposed to mine breast cancer genes. Compared to the well established random walk approach, it showed better performance in correctly ranking cancer genes and worked robustly with respect to noise within source data. To the best of our knowledge, it's the first such effort to quantitatively analyze noise tolerance between different breast cancer gene mining methods. The sorted gene list can be valuable for breast cancer research. The proposed quantitative noise analysis method may also prove useful for other data integration efforts. It is hoped that the current work can lead to more discussions about influence of data noise on different computational methods for mining disease genes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic chart for mining breast cancer genes. Four different types of data were used as input: PPI data, human gene expression data, known breast cancer genes and GO annotations. Gene expression data (GDSes) from the GEO database were clustered. Known breast cancer genes and their enriched GO annotations were used to rank genes in those clusters. From the PPI network, three network topological attributes were computed to rank genes in the network. Finally, all individual rankings were combined into a final rank, which represents a gene’s overall probability of being involved in breast cancer.
Figure 2
Figure 2
Performance of our method for different λ values. P-score was the average ranking of KnownSet in top 10% of the final sorted list. A smaller P-score (ranked higher) meant better capability to correctly rank known breast cancer genes.
Figure 3
Figure 3
Evidence code distribution of GO annotation (for Homo sapiens). IEA means that the denoted GO annotations were inferred from electronic annotation and not curated manually. IEA counted for the majority of GO annotations.
Figure 4
Figure 4
Five-fold cross-validation performance evaluation. In cross validation, different ratios of noise data were added to input data, with the ratio changing from 0 to 50, where 0 meant only known breast cancer genes were used as input data. Performance was evaluated in terms of the F-score. k was a ranking threshold to judge a ranked gene as a true breast cancer gene by the proposed method. (a)k=300; (b)k=400.
Figure 5
Figure 5
Ranking comparison performance. Randomly picked genes were added to the KnownSet, and performances of the proposed method and random walk approach were then compared.
Figure 6
Figure 6
Ranking genes by network attributes. Genes of the PPI network were sorted according to network attributes (node degree was used as an example here). The scores were then converted into ranking values.
Figure 7
Figure 7
Getting GO enriched gene set by GO enrichment analysis. Three different tools were used to perform GO term enrichment analyse for known breast cancer genes. Top 50 enriched GO terms were picked from results obtained by each tool, and their union were generated. Nine cancer-hallmark GO terms from [37] were added into the enriched GO term set. The enriched GO term set were re-mapped back to a set of human genes based on Homo sapiens GO annotations, called the GO enriched gene set (GOSet).
Figure 8
Figure 8
Ranking gene clusters from GEO expression profiles. For a cluster i, random samples of same size were drawn from the same GDS and their overlaps with GOSet were computed. P-value was used to represent significance of a cluster’s overlap with GOSet. The smaller the p-value, the higher the ranking.
Figure 9
Figure 9
Computing precision and recall. A was the set of known breast cancer genes, and B was the set of breast cancer genes which had been predicted as breast cancer genes by the proposed method. Precision represented ability to reject unrelated genes, and recall represented ability to obtain true breast cancer genes.

References

    1. Wang X, Gulbahce N, Yu H. Network-based methods for human disease gene prediction. Brief Funct Genomics. 2011;10:280–293. doi: 10.1093/bfgp/elr024. - DOI - PubMed
    1. Wu X, Li S. Cancer gene prediction using a network approach. Cancer Systems Biology. 2010. pp. 191–212.
    1. Siegel R, Naishadham D, Jemal A. Cancer statistics, 2012. CA Cancer J Clin. 2012;62:10–29. doi: 10.3322/caac.20138. - DOI - PubMed
    1. Materi W, Wishart DS. Computational systems biology in cancer: modeling methods and applications. Gene Regul Syst Bio. 2007;1:91–110. - PMC - PubMed
    1. Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18:644–652. doi: 10.1101/gr.071852.107. - DOI - PMC - PubMed

Publication types