Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 1;6(6):pkac070.
doi: 10.1093/jncics/pkac070.

Impact of Clinical Data Veracity on Cancer Genomic Research

Affiliations

Impact of Clinical Data Veracity on Cancer Genomic Research

Sunali Mehta et al. JNCI Cancer Spectr. .

Abstract

Genomic analysis of tumors is transforming our understanding of cancer. However, although a great deal of attention is paid to the accuracy of the cancer genomic data itself, less attention has been paid to the accuracy of the associated clinical information that renders the genomic data useful for research. In this brief communication, we suggest that omissions and errors in clinical annotations have a major impact on the interpretation of cancer genomic data. We describe our discovery of annotation omissions and errors when reviewing an already carefully annotated colorectal cancer gene expression dataset from our laboratory. The potential importance of clinical annotation omissions and errors was then explored using simulation analyses with an independent genomic dataset. We suggest that the completeness and veracity of clinical annotations accompanying cancer genomic data require renewed focus by the oncology research community, when planning new collections and when interpreting existing cancer genomic data.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Review and correction of the clinical annotations of a colorectal cancer gene expression dataset. Clinicopathological data pertaining to the 205 colorectal tumor samples that were collected prospectively in New Zealand between 1996 and 2013 is compared before and after clinical review and correction. A) Graphic illustrating the overall relationships between different types of clinical data omissions or errors, and the corrections they required. White bars indicate a data point missing in the original dataset. Pale shades indicate a data point that was correct in the original data set and was unmodified by revision. Dark shades indicate an inaccurate data point in the original dataset, which was corrected in the revised dataset. Grey bars in the DFS column indicate a stage IV tumor not included in calculation of DFS. The black bar crossing all columns indicates a neuroendocrine tumor excluded from subsequent revision or analysis. Beneath this graphic is the number and % of missing and inaccurate data of each type in the original clinical annotation. B) Kaplan-Meier DFS analysis is shown for the original clinical data (blue) and the revised data (red) for the same patient cohort. The x-axis shows time since diagnosis in years; the y-axis shows proportion of patients remaining disease free. A log-rank test indicated that the difference between these survival curves is statistically significant (P = .005). Samples and data have been obtained with informed consent and ethical approval from the New Zealand multicenter ethics committee. DFS = disease-free survival; OS = overall survival; M = metastasis; N = lymph nodes; T = tumor; TTR = time to relapse; stage = American Joint Committee on Cancer stage (7th ed).
Figure 2.
Figure 2.
1000-fold simulation analysis of the effect of missing and incorrect clinical annotations on gene expression-DFS associations for colorectal tumors. The Gene Expression Omnibus GSE17536 gene expression data set was used (18). Cox proportional hazards survival models were used with a cut-off for statistical significance of a P value less than .05 without (A, C, E) and with (B, D, F) Benjamini-Hochberg multiple testing correction. Within each panel, boxplots indicate the distribution of the number of RNAs significantly associated with DFS (left set of boxes), RNAs that gained de novo statistically significant associations with DFS after the simulated changes (middle set), and RNAs that lost their statistically significant associations (right set). Various combinations of percentage of patients with data changes and the degree of these changes are shown. Simulations shown in A and B estimate the effect of data omission, C and D changes to alive or dead patient status, and E and F time to relapse. Lower and upper bounds of the boxes indicate the 25th and 75th percentiles, respectively; horizontal lines in boxes indicate the 50th percentile. Lower and upper bounds of the whiskers indicate 5th and 95th percentile, respectively. For example in A, the leftmost (yellow) box plot represents number of RNAs significantly associated with DFS (P < .05) in the original dataset. The orange, red, and pink boxes represent total, gain, or loss of RNAs significantly associated (P < .05) with DFS after 1000 simulated omissions of 5%, 10%, or 20% of data, respectively. G) Summarizes the effects of simulated loss or gain of clinical data on the ability to detect statistically significant prognostic associations. Distributions of sensitivity (= recall; TP/[TP + FN]), specificity (TN/[TN + FP]), positive predictive value ([PPV] = precision; TP/[TP+FP]), and negative predictive value ([NPV]; TN/[TN+FN]) are tabulated, where TP is true positive, TN is true negative, FP is false positive, and FN is false negative, with positive and negative defined as Cox proportional hazards association between RNA transcript abundance and DFS of a P value of no more than .05 or a P value of more than .05, respectively. The original GSE17536 gene expression data set is considered the gold standard for these simulations, in which sensitivity, specificity, PPV, and NPV would be considered to have a value of 1. Omission or change to components of clinical data is seen to affect sensitivity much more than specificity. This is expected, because although similar numbers of statistical associations with DFS are gained as are lost in the simulations (shown in A-F), in the original data set, only a small proportion of genes have a statistically significant Cox proportional hazards association (P ≤ .05) with DFS. These calculations used the confusion matrix function of the caret package (25) in the R statistical software framework. DFS = disease-free survival.

References

    1. Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330-337. - PMC - PubMed
    1. Kazantseva M, Mehta S, Eiholzer RA, et al.The Δ133p53β isoform promotes an immunosuppressive environment leading to aggressive prostate cancer. Cell Death Dis. 2019;10(9):631. - PMC - PubMed
    1. Lawrence B, Blenkiron C, Parker K, et al.Recurrent loss of heterozygosity correlates with clinical outcome in pancreatic neuroendocrine cancer. NPJ Genom Med. 2018;3:18. - PMC - PubMed
    1. Lasham A, Knowlton N, Mehta SY, et al.Breast cancer patient prognosis is determined by the interplay between TP53 mutation and alternative transcript expression: insights from TP53 long amplicon digital PCR assays. Cancers (Basel). 2021;13(7):1531. - PMC - PubMed
    1. Muthukaruppan A, Lasham A, Woad KJ, et al.Multimodal assessment of estrogen receptor mRNA profiles to quantify Estrogen pathway activity in breast tumors. Clin Breast Cancer. 2016;17(2):139-153. doi:10.1016/j.clbc.2016.09.001. - DOI - PubMed

Publication types

LinkOut - more resources