Impact of Clinical Data Veracity on Cancer Genomic Research

Sunali Mehta^{1

2}, Deborah Wright³, Michael A Black^{1

4}, Arend Merrie⁵, Ahmad Anjomshoaa⁶, Fran Munro³, Anthony Reeve⁴, John McCall³, Cristin Print^{1

7}

Affiliations

¹ Maurice Wilkins Centre, University of Auckland, Auckland, New Zealand.
² Pathology Department, University of Otago, Dunedin, New Zealand.
³ Department of Surgical Sciences, University of Otago, Dunedin, New Zealand.
⁴ Department of Biochemistry, University of Otago, Dunedin, New Zealand.
⁵ Department of Surgery, University of Auckland, Auckland, New Zealand.
⁶ Medical Genetics Division, Faculty of Medicine, Kerman University of Medical Sciences, Kerman, Iran.
⁷ Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand.

PMID: 36255250
PMCID: PMC9648686
DOI: 10.1093/jncics/pkac070

Impact of Clinical Data Veracity on Cancer Genomic Research

Sunali Mehta et al. JNCI Cancer Spectr. 2022.

. 2022 Nov 1;6(6):pkac070.

doi: 10.1093/jncics/pkac070.

Authors

Sunali Mehta^{1

2}, Deborah Wright³, Michael A Black^{1

4}, Arend Merrie⁵, Ahmad Anjomshoaa⁶, Fran Munro³, Anthony Reeve⁴, John McCall³, Cristin Print^{1

7}

Affiliations

¹ Maurice Wilkins Centre, University of Auckland, Auckland, New Zealand.
² Pathology Department, University of Otago, Dunedin, New Zealand.
³ Department of Surgical Sciences, University of Otago, Dunedin, New Zealand.
⁴ Department of Biochemistry, University of Otago, Dunedin, New Zealand.
⁵ Department of Surgery, University of Auckland, Auckland, New Zealand.
⁶ Medical Genetics Division, Faculty of Medicine, Kerman University of Medical Sciences, Kerman, Iran.
⁷ Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand.

PMID: 36255250
PMCID: PMC9648686
DOI: 10.1093/jncics/pkac070

Abstract

Genomic analysis of tumors is transforming our understanding of cancer. However, although a great deal of attention is paid to the accuracy of the cancer genomic data itself, less attention has been paid to the accuracy of the associated clinical information that renders the genomic data useful for research. In this brief communication, we suggest that omissions and errors in clinical annotations have a major impact on the interpretation of cancer genomic data. We describe our discovery of annotation omissions and errors when reviewing an already carefully annotated colorectal cancer gene expression dataset from our laboratory. The potential importance of clinical annotation omissions and errors was then explored using simulation analyses with an independent genomic dataset. We suggest that the completeness and veracity of clinical annotations accompanying cancer genomic data require renewed focus by the oncology research community, when planning new collections and when interpreting existing cancer genomic data.

PubMed Disclaimer

Figures

**Figure 1.**
Review and correction of the clinical annotations of a colorectal cancer gene expression dataset. Clinicopathological data pertaining to the 205 colorectal tumor samples that were collected prospectively in New Zealand between 1996 and 2013 is compared before and after clinical review and correction. A) Graphic illustrating the overall relationships between different types of clinical data omissions or errors, and the corrections they required. White bars indicate a data point missing in the original dataset. Pale shades indicate a data point that was correct in the original data set and was unmodified by revision. Dark shades indicate an inaccurate data point in the original dataset, which was corrected in the revised dataset. Grey bars in the DFS column indicate a stage IV tumor not included in calculation of DFS. The black bar crossing all columns indicates a neuroendocrine tumor excluded from subsequent revision or analysis. Beneath this graphic is the number and % of missing and inaccurate data of each type in the original clinical annotation. B) Kaplan-Meier DFS analysis is shown for the original clinical data (**blue**) and the revised data (**red**) for the same patient cohort. The x-axis shows time since diagnosis in years; the y-axis shows proportion of patients remaining disease free. A log-rank test indicated that the difference between these survival curves is statistically significant (P = .005). Samples and data have been obtained with informed consent and ethical approval from the New Zealand multicenter ethics committee. DFS = disease-free survival; OS = overall survival; M = metastasis; N = lymph nodes; T = tumor; TTR = time to relapse; stage = American Joint Committee on Cancer stage (7th ed).

**Figure 2.**
1000-fold simulation analysis of the effect of missing and incorrect clinical annotations on gene expression-DFS associations for colorectal tumors. The Gene Expression Omnibus GSE17536 gene expression data set was used (18). Cox proportional hazards survival models were used with a cut-off for statistical significance of a P value less than .05 without (**A, C, E**) and with (**B, D, F**) Benjamini-Hochberg multiple testing correction. Within each panel, boxplots indicate the distribution of the number of RNAs significantly associated with DFS (**left set of boxes**), RNAs that gained de novo statistically significant associations with DFS after the simulated changes (**middle set**), and RNAs that lost their statistically significant associations (**right set**). Various combinations of percentage of patients with data changes and the degree of these changes are shown. Simulations shown in A and B estimate the effect of data omission, C and D changes to alive or dead patient status, and E and F time to relapse. Lower and upper bounds of the boxes indicate the 25th and 75th percentiles, respectively; **horizontal lines** in boxes indicate the 50th percentile. Lower and upper bounds of the whiskers indicate 5th and 95th percentile, respectively. For example in A, the leftmost (**yellow**) box plot represents number of RNAs significantly associated with DFS (P < .05) in the original dataset. The **orange**, **red**, and **pink** boxes represent total, gain, or loss of RNAs significantly associated (P < .05) with DFS after 1000 simulated omissions of 5%, 10%, or 20% of data, respectively. G) Summarizes the effects of simulated loss or gain of clinical data on the ability to detect statistically significant prognostic associations. Distributions of sensitivity (= recall; TP/[TP + FN]), specificity (TN/[TN + FP]), positive predictive value ([PPV] = precision; TP/[TP+FP]), and negative predictive value ([NPV]; TN/[TN+FN]) are tabulated, where TP is true positive, TN is true negative, FP is false positive, and FN is false negative, with positive and negative defined as Cox proportional hazards association between RNA transcript abundance and DFS of a P value of no more than .05 or a P value of more than .05, respectively. The original GSE17536 gene expression data set is considered the gold standard for these simulations, in which sensitivity, specificity, PPV, and NPV would be considered to have a value of 1. Omission or change to components of clinical data is seen to affect sensitivity much more than specificity. This is expected, because although similar numbers of statistical associations with DFS are gained as are lost in the simulations (shown in **A-F**), in the original data set, only a small proportion of genes have a statistically significant Cox proportional hazards association (P ≤ .05) with DFS. These calculations used the confusion matrix function of the caret package (25) in the R statistical software framework. DFS = disease-free survival.

See this image and copyright information in PMC

References

1. Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330-337. - PMC - PubMed
1. Kazantseva M, Mehta S, Eiholzer RA, et al.The Δ133p53β isoform promotes an immunosuppressive environment leading to aggressive prostate cancer. Cell Death Dis. 2019;10(9):631. - PMC - PubMed
1. Lawrence B, Blenkiron C, Parker K, et al.Recurrent loss of heterozygosity correlates with clinical outcome in pancreatic neuroendocrine cancer. NPJ Genom Med. 2018;3:18. - PMC - PubMed
1. Lasham A, Knowlton N, Mehta SY, et al.Breast cancer patient prognosis is determined by the interplay between TP53 mutation and alternative transcript expression: insights from TP53 long amplicon digital PCR assays. Cancers (Basel). 2021;13(7):1531. - PMC - PubMed
1. Muthukaruppan A, Lasham A, Woad KJ, et al.Multimodal assessment of estrogen receptor mRNA profiles to quantify Estrogen pathway activity in breast tumors. Clin Breast Cancer. 2016;17(2):139-153. doi:10.1016/j.clbc.2016.09.001. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of Clinical Data Veracity on Cancer Genomic Research

Affiliations

Impact of Clinical Data Veracity on Cancer Genomic Research

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical