Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 7(Suppl 7):S2.
doi: 10.1186/1471-2164-16-S7-S2. Epub 2015 Jun 11.

Multiple signatures of a disease in potential biomarker space: Getting the signatures consensus and identification of novel biomarkers

Multiple signatures of a disease in potential biomarker space: Getting the signatures consensus and identification of novel biomarkers

Ghim Siong Ow et al. BMC Genomics. 2015.

Abstract

Background: The lack of consensus among reported gene signature subsets (GSSs) in multi-gene biomarker discovery studies is often a concern for researchers and clinicians. Subsequently, it discourages larger scale prospective studies, prevents the translation of such knowledge into a practical clinical setting and ultimately hinders the progress of the field of biomarker-based disease classification, prognosis and prediction.

Methods: We define all "gene identificators" (gIDs) as constituents of the entire potential disease biomarker space. For each gID in a GSS of interest ("tested GSS"/tGSS), our method counts the empirical frequency of gID co-occurrences/overlaps in other reference GSSs (rGSSs) and compares it with the expected frequency generated via implementation of a randomized sampling procedure. Comparison of the empirical frequency distribution (EFD) with the expected background frequency distribution (BFD) allows dichotomization of statistically novel (SN) and common (SC) gIDs within the tGSS.

Results: We identify SN or SC biomarkers for tGSSs obtained from previous studies of high-grade serous ovarian cancer (HG-SOC) and breast cancer (BC). For each tGSS, the EFD of gID co-occurrences/overlaps with other rGSSs is characterized by scale and context-dependent Pareto-like frequency distribution function. Our results indicate that while independently there is little overlap between our tGSS with individual rGSSs, comparison of the EFD with BFD suggests that beyond a confidence threshold, tested gIDs become more common in rGSSs than expected. This validates the use of our tGSS as individual or combined prognostic factors. Our method identifies SN and SC genes of a 36-gene prognostic signature that stratify HG-SOC patients into subgroups with low, intermediate or high-risk of the disease outcome. Using 70 BC rGSSs, the method also predicted SN and SC BC prognostic genes from the tested obesity and IGF1 pathway GSSs.

Conclusions: Our method provides a strategy that identify/predict within a tGSS of interest, gID subsets that are either SN or SC when compared to other rGSSs. Practically, our results suggest that there is a stronger association of the IGF1 signature genes with the 70 BC rGSSs, than for the obesity-associated signature. Furthermore, both SC and SN genes, in both signatures could be considered as perspective prognostic biomarkers of BCs that stratify the patients onto low or high risks of cancer development.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Definition of novel or common biomarkers. (A) Traditional definition of novel or common biomarkers. (B) Statistical definition of novel or common biomarkers. A further vertical dimension is extended which provides a statistical measure of whether the signature gene is considered "novel".
Figure 2
Figure 2
Schema of gene list comparison with other defined sets. (A) Actual observations of gene lists overlap between single list of interest (AS0) with other defined sets. (B) Observations of gene lists overlap in a simulation where other defined gene sets are randomly and independently sampled without replacement. AS and RS denote actual and random set respectively. Om and ROm denotes overlap segments and random overlap segments respectively. Blue solid circle represents our gene list of interest (AS0). Green oval, red rectangle and yellow triangle represent 3 other defined set of genes with sizes |ASi = 1|, |ASi = 2|, |ASi = 3| respectively.
Figure 3
Figure 3
Family of null frequency distribution of expected co-occurrences of our signature genes with other signatures. The horizontal axis represents the number of samples that contain the gene from our signature of interest. The dotted lines represent the fitted curves of Weibull function whereas the dashed lines represent the fitted curves of Sigmoid function
Figure 4
Figure 4
Actual and expected frequency distribution of gene overlap from a query signature with other reference signatures. Comparison of genes from (A) 36-gene ovarian cancer prognostic gene signature, (B) tumor breast obesity gene signature and (C) tumor breast IGF1 gene signature, with other reference gene signatures for that disease. (D) Comparison of the actual frequency distribution generated from tumor breast obesity (From B) and tumor breast IGF1 (From C). The expected frequency distributions were generated via performing N simulations, where N is 100, 1000 or 10000. The y-axis is log10 transformed. p1 denotes the two-sided p-value from Kolmogorov-Smirnov statistic which tests if the actual and expected (for N = 100) distribution are similar. p2 denotes the p-value that represents the significance of that threshold in dichotomizing statistically novel or common biomarkers from a GSS of interest.
Figure 5
Figure 5
Classification of high-grade serous ovarian cancer patients. The patients diagnosed with high-grade serous ovarian cancer were classified using a data-driven method for statistically novel biomarkers (A) FZD1 and (B) HGF and common biomarkers (C) COL3A1 and (D) EDNRA. Log-rank tests were used to assess the survival statistical significance of the two patient subgroups. Expr: expression.
Figure 6
Figure 6
Classification of breast cancer patients for both Stockholm and Uppsala patient cohort. The patients diagnosed with breast cancer were classified using data-driven method for statistically novel biomarkers (A) PIK3C3 and (B) APPBP2 and common biomarkers (C) IL6ST and (D) DUSP6. Top panel: Stockholm breast cancer patient cohort, Bottom panel: Uppsala breast cancer patient cohort. Log-rank tests were used to assess the survival statistical significance of the two patient subgroups. Expr: expression.

Similar articles

Cited by

References

    1. Chin L, Hahn WC, Getz G, Meyerson M. Making sense of cancer genomic data. Genes & development. 2011;25(6):534–555. doi: 10.1101/gad.2017311. - DOI - PMC - PubMed
    1. Lizardi PM, Forloni M, Wajapeyee N. Genome-wide approaches for cancer gene discovery. Trends Biotechnol. 2011;29(11):558–568. doi: 10.1016/j.tibtech.2011.06.003. - DOI - PMC - PubMed
    1. Fortney K, Jurisica I. Integrative computational biology for cancer research. Hum Genet. 2011;130(4):465–481. doi: 10.1007/s00439-011-0983-z. - DOI - PMC - PubMed
    1. Li Y, Chen L. Big Biological Data: Challenges and Opportunities. Genomics, Proteomics & Bioinformatics. 2014. - DOI - PMC - PubMed
    1. Wang Y, Zhang XS, Chen L. Computational systems biology in the big data era. BMC Syst Biol. 2013;7(Suppl 2):S1. doi: 10.1186/1752-0509-7-S2-S1. - DOI - PMC - PubMed

Publication types

Substances