Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Mar 27:9:172.
doi: 10.1186/1471-2105-9-172.

Predicting cancer involvement of genes from heterogeneous data

Affiliations

Predicting cancer involvement of genes from heterogeneous data

Ramon Aragues et al. BMC Bioinformatics. .

Abstract

Background: Systematic approaches for identifying proteins involved in different types of cancer are needed. Experimental techniques such as microarrays are being used to characterize cancer, but validating their results can be a laborious task. Computational approaches are used to prioritize between genes putatively involved in cancer, usually based on further analyzing experimental data.

Results: We implemented a systematic method using the PIANA software that predicts cancer involvement of genes by integrating heterogeneous datasets. Specifically, we produced lists of genes likely to be involved in cancer by relying on: (i) protein-protein interactions; (ii) differential expression data; and (iii) structural and functional properties of cancer genes. The integrative approach that combines multiple sources of data obtained positive predictive values ranging from 23% (on a list of 811 genes) to 73% (on a list of 22 genes), outperforming the use of any of the data sources alone. We analyze a list of 20 cancer gene predictions, finding that most of them have been recently linked to cancer in literature.

Conclusion: Our approach to identifying and prioritizing candidate cancer genes can be used to produce lists of genes likely to be involved in cancer. Our results suggest that differential expression studies yielding high numbers of candidate cancer genes can be filtered using protein interaction networks.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Calculating the Cancer Linker Degree (CLD) of a protein. The Cancer Linker Degree (CLD) of a protein is defined as the absolute number of partners of the protein that are known to be involved in cancer. The procedure followed to calculate the CLD of a protein consists of 3 steps: 1) setting the known cancer genes as seeds; 2) retrieving the direct interaction partners for the known cancer genes; and 3) calculating the CLD of each protein (i.e. the number of known cancer genes to which it is connected). In the example provided, we observe that proteins with high CLD are more likely to be cancer gene products that proteins with low CLD.
Figure 2
Figure 2
Positive predictive value and Sensitivity when predicting cancer genes based on the cancer linker degree of proteins. The positive predictive value and sensitivity shown are for accumulative cancer linker degrees (CLD) (i.e. cancer linker degree 5 represents proteins with CLD ≥ 5). The average protein in the data set is represented by CLD 0.
Figure 3
Figure 3
Positive predictive value and sensitivity when predicting cancer genes based on differential expression data. The positive predictive value and sensitivity are shown for 12 cancer types and genes over- or under-expressed in at least 1, 2 and 5 cancer types.
Figure 4
Figure 4
Positive predictive value and sensitivity when predicting cancer genes based on their probability of being a cancer gene according to structural, functional and evolutionary properties (SF-Probability). The positive predictive value and sensitivity shown are for accumulative SF-Probabilities (i.e. SF-Probability 0.7 represents genes with SF-Probability ≥ 0.7). The average gene in the data set is represented by SF-Probability ≥ 0. SF-Probabilities were obtained from [37].
Figure 5
Figure 5
The average number of cancer types in which genes appear differentially expressed (A) and the probability of being a cancer gene according to structural, functional and evolutionary properties (B) are plotted as a function the cancer linker degree (CLD) of the gene products. A) The average number of cancer types shown are for an accumulative CLD (i.e. CLD 5 represents proteins with CLD ≥ 5). The average protein in the dataset is represented by CLD 0. Known cancer genes appear differentially expressed in an average of 2.8 cancer types. B) The average SF-Probabilities shown are for an accumulative CLD (i.e. CLD 5 represents proteins with CLD ≥ 5). The average protein in the dataset is represented by CLD 0. Known cancer genes had an average SF-Probability of 0.41.
Figure 6
Figure 6
Contour maps for positive predictive value and sensitivity obtained when varying the thresholds applied by the integrative approach. In each of the following images, the x-axis is the SF-Probability threshold and the y-axis is the cancer linker degree (CLD) threshold. For a given restriction on the number of cancer types in which a gene must be differentially expressed in order to be considered a candidate (no restriction, at least two cancer types and at least 5 cancer types), the positive predictive value and sensitivity are provided for each combination of CLD and SF-Probability. Positive predictive values and sensitivities are shown using colored contour maps, from red (i.e. 0) to turquoise (i.e., 0.7 for positive predictive value and 0.3 for sensitivity). For example, imposing a gene to be differentially expressed in at least two cancer types, with a CLD of 6 and with an SF-Probability of 0.4, the positive predictive value is 0.4 for sensitivity of 0.05.
Figure 7
Figure 7
Positive predictive value calculated for diverse overlaps of cancer gene candidates. The criteria applied was the following: (i) cancer linker degree ≥ 5; (ii) differentially expressed in at least four cancer types; and (iii) SF-Probability ≥ 0.6. The Venn diagram shows the total number of candidates, the number of hits (i.e. known cancer genes among the candidates) and the positive predictive value for overlap case. For example, the positive predictive value when solely applying an SF-Probability threshold of 0.6 was 14%. In contrast, when combining the SF-Probability with a cancer linker degree threshold of 5, the positive predictive value was 37% (59 hits for a total of 158 candidates).
Figure 8
Figure 8
Procedure followed to predict cancer gene candidates. First, a cancer protein interaction network is built from the list of known cancer genes. Second, expression data from different cancer types is mapped onto the network. Third, probabilities of being a cancer gene based on structural, functional and evolutionary properties are retrieved for proteins in the network. Fourth, cancer genes are predicted based on the thresholds provided by the user for each type of data.

Similar articles

Cited by

References

    1. Hanahan D, Weinberg RA. The hallmarks of cancer. Cell. 2000;100:57–70. doi: 10.1016/S0092-8674(00)81683-9. - DOI - PubMed
    1. Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nat Med. 2004;10:789–799. doi: 10.1038/nm1087. - DOI - PubMed
    1. Bielas JH, Loeb KR, Rubin BP, True LD, Loeb LA. Human cancers express a mutator phenotype. Proceedings of the National Academy of Sciences of the United States of America. 2006;103:18238–18242. doi: 10.1073/pnas.0607057103. - DOI - PMC - PubMed
    1. Segal E, Friedman N, Koller D, Regev A. A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004;36:1090–1098. doi: 10.1038/ng1434. - DOI - PubMed
    1. Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM. Concordance among gene-expression-based predictors for breast cancer. The New England journal of medicine. 2006;355:560–569. doi: 10.1056/NEJMoa052933. - DOI - PubMed

Publication types

LinkOut - more resources