Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 28;7(26):40200-40220.
doi: 10.18632/oncotarget.9571.

Big data and computational biology strategy for personalized prognosis

Affiliations

Big data and computational biology strategy for personalized prognosis

Ghim Siong Ow et al. Oncotarget. .

Abstract

The era of big data and precision medicine has led to accumulation of massive datasets of gene expression data and clinical information of patients. For a new patient, we propose that identification of a highly similar reference patient from an existing patient database via similarity matching of both clinical and expression data could be useful for predicting the prognostic risk or therapeutic efficacy.Here, we propose a novel methodology to predict disease/treatment outcome via analysis of the similarity between any pair of patients who are each characterized by a certain set of pre-defined biological variables (biomarkers or clinical features) represented initially as a prognostic binary variable vector (PBVV) and subsequently transformed to a prognostic signature vector (PSV). Our analyses revealed that Euclidean distance rather correlation distance measure was effective in defining an unbiased similarity measure calculated between two PSVs.We implemented our methods to high-grade serous ovarian cancer (HGSC) based on a 36-mRNA predictor that was previously shown to stratify patients into 3 distinct prognostic subgroups. We studied and revealed that patient's age, when converted into binary variable, was positively correlated with the overall risk of succumbing to the disease. When applied to an independent testing dataset, the inclusion of age into the molecular predictor provided more robust personalized prognosis of overall survival correlated with the therapeutic response of HGSC and provided benefit for treatment targeting of the tumors in HGSC patients.Finally, our method can be generalized and implemented in many other diseases to accurately predict personalized patients' outcomes.

Keywords: aging; big data; ovarian cancer; personalized prognosis; risk stratification.

PubMed Disclaimer

Conflict of interest statement

All authors declare no conflict of interests and no competing financial interests.

Figures

Figure 1
Figure 1. Proposed schema of big data and strategy for personalized diagnosis, prognosis or prediction of therapy success
Figure 2
Figure 2. Stratification of reference patients from the TCGA training cohort into two prognostic subgroups based on their age at diagnosis
One-dimensional data-driven grouping method was used as the classification method. A. Plot of stratification log-rank p-value (transformed y-axis) against the patients’ age cut-off value. B. Plot of hazard ratio against the patients’ age cut-off value. C. Histogram and cumulative distribution of the patients’ age at diagnosis. D. Survival curves, E. hazard curves and F. cumulative hazard curves of the training TCGA cohort of 349 HGSC patients.
Figure 3
Figure 3
A. Heatmap of risk classification for 37 variables and 349 HGSC from the TCGA cohort. B. Kaplan-Meier survival curves of the four patient prognostic subgroups. C–D. Nelson-Aalen estimated cumulative hazard curves and hazard curves of the four patient prognostic subgroups. The 37 variables comprise 36 mRNA expression variables and 1 clinical variable (age). The risk classification for variable and patient was assessed using the 1D-DDg method. The average weighted risk (AWR) value for each patient across all 37 variables was calculated and used for ranking patient samples. The patient cohort was arbitrarily classified into four equal-sized sub-groups based on their AWR values.
Figure 4
Figure 4
Scatter plots of A-C. Euclidean distance and D-F. Kendall's tau rank correlation coefficient against average weighted risk (AWR) values calculated for three representative testing samples GSM249732, GSM249737 and GSM249853 against each reference sample in the training cohort (n=349). Each query sample and reference sample is represented by a prognostic binary variable vector (PBVV). Each point on the plot represents each of 349 reference samples, and the y-axis represents the value of the Euclidean distance or Kendall's tau rank correlation coefficient with the testing sample. The color blue, green and red corresponds to the low, intermediate and high prognostic risk group of the reference patients. The x-axis represents the AWR values associated with each reference sample. (All results can be found in Supplementary Files 2 and 3).
Figure 5
Figure 5
Two dimensional representation of A. PBVVs and B. PSVs across the 37 variables for one representative query patient and three representative reference patients. Euclidean distances between the query patient and the three reference patients are shown inset.
Figure 6
Figure 6
Scatter plots of A-C. Euclidean distance and D-F. Kendall's tau rank correlation coefficient against average weighted risk (AWR) values calculated for three representative testing samples GSM249732, GSM249737 and GSM249853 against each reference sample in the training cohort (n=349). Each query sample and reference sample is represented by a prognostic signature vector (PSV). Each point on the plot represents each of 349 reference samples, and the y-axis represents the value of the Euclidean distance or Kendall's tau rank correlation coefficient with the testing sample. The color blue, green and red corresponds to the low, intermediate and high prognostic risk group of the reference patients. The x-axis represents the AWR values associated with each reference sample. (All results can be found in Supplementary Files 4 and 5).
Figure 7
Figure 7
A. Stratification curves of the training cohort of 349 TCGA HGSC patients obtained via a 37-variable classifier comprising 36 mRNA variables and 1 age variable. B. Stratification curves of the testing cohort of 359 patients from GSE9899 and GSE36712 obtained via the 37-variable classifier comprising 36 mRNA variables and 1 age variable. C. Stratification curves of the testing cohort of 360 patients from GSE9899 and GSE36712 obtained via a 36-variable classifier comprising 36 mRNA variables. The p-values were calculated via multivariate log-rank test. Left panel: Kaplan-Meier curves; Middle panel: Cumulative hazard curves; Right panel: Hazard curves.

References

    1. Jameson JL, Longo DL. Precision medicine—personalized, problematic, and promising. N Engl J Med. 2015;372:2229–2234. - PubMed
    1. Chan IS, Ginsburg GS. Personalized medicine: progress and promise. Annu Rev Genomics Hum Genet. 2011;12:217–244. - PubMed
    1. West M, Ginsburg GS, Huang AT, Nevins JR. Embracing the complexity of genomic data for personalized medicine. Genome Res. 2006;16:559–566. - PubMed
    1. Chawla NV, Davis DA. Bringing Big Data to Personalized Healthcare: A Patient-Centered Framework. J Gen Intern Med. 2013;28:S660–5. - PMC - PubMed
    1. Eddy JA, Sung J, Geman D, Price ND. Relative expression analysis for molecular cancer diagnosis and prognosis. Technol Cancer Res Treat. 2010;9:149–159. - PMC - PubMed