Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Samual MacDonald^{1

2

3}, Helena Foley¹, Melvyn Yap¹, Rebecca L Johnston⁴, Kaiah Steven¹, Lambros T Koufariotis⁴, Sowmya Sharma^{4

5}, Scott Wood⁴, Venkateswar Addala⁴, John V Pearson⁴, Fred Roosta^{2

3}, Nicola Waddell⁴, Olga Kondrashova⁶, Maciej Trzaskowski^{7

8

9

10}

Affiliations

¹ Max Kelsen, Brisbane, QLD, Australia.
² ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia.
³ The University of Queensland, Brisbane, Australia.
⁴ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia.
⁵ ACL Pathology, Bella Vista, NSW, Australia.
⁶ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia. olga.kondrashova@qimrberghofer.edu.au.
⁷ Max Kelsen, Brisbane, QLD, Australia. m.trzaskowski@uq.edu.au.
⁸ ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia. m.trzaskowski@uq.edu.au.
⁹ The University of Queensland, Brisbane, Australia. m.trzaskowski@uq.edu.au.
¹⁰ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia. m.trzaskowski@uq.edu.au.

PMID: 37149669
PMCID: PMC10164181
DOI: 10.1038/s41598-023-31126-5

Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Samual MacDonald et al. Sci Rep. 2023.

. 2023 May 6;13(1):7395.

doi: 10.1038/s41598-023-31126-5.

Authors

Affiliations

¹ Max Kelsen, Brisbane, QLD, Australia.
² ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia.
³ The University of Queensland, Brisbane, Australia.
⁴ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia.
⁵ ACL Pathology, Bella Vista, NSW, Australia.
⁶ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia. olga.kondrashova@qimrberghofer.edu.au.
⁷ Max Kelsen, Brisbane, QLD, Australia. m.trzaskowski@uq.edu.au.
⁸ ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia. m.trzaskowski@uq.edu.au.
⁹ The University of Queensland, Brisbane, Australia. m.trzaskowski@uq.edu.au.
¹⁰ QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia. m.trzaskowski@uq.edu.au.

PMID: 37149669
PMCID: PMC10164181
DOI: 10.1038/s41598-023-31126-5

Abstract

Uncertainty estimation is crucial for understanding the reliability of deep learning (DL) predictions, and critical for deploying DL in the clinic. Differences between training and production datasets can lead to incorrect predictions with underestimated uncertainty. To investigate this pitfall, we benchmarked one pointwise and three approximate Bayesian DL models for predicting cancer of unknown primary, using three RNA-seq datasets with 10,968 samples across 57 cancer types. Our results highlight that simple and scalable Bayesian DL significantly improves the generalisation of uncertainty estimation. Moreover, we designed a prototypical metric-the area between development and production curve (ADP), which evaluates the accuracy loss when deploying models from development to production. Using ADP, we demonstrate that Bayesian DL improves accuracy under data distributional shifts when utilising 'uncertainty thresholding'. In summary, Bayesian DL is a promising approach for generalising uncertainty, improving performance, transparency, and safety of DL models for deployment in the real world.

PubMed Disclaimer

Conflict of interest statement

M.Y., H.F., S.M., K.S. and M.T. are employed by Max Kelsen, which is a commercial company with an embedded research team. J.V.P. and N.W. are founders and shareholders of genomiQa Pty Ltd, and members of its Board. S.S., A.B., O.K., V.A., S.W, L.T.K. and R.L.J have no competing interests.

Figures

**Figure 1**
Overview of the study design. (a) Simplified study workflow. TCGA primary cancer types comprised the training and IID validation data. OOD test data comprised of the TCGA (metastatic cancer types), Met500 and ICD datasets, which included primary, metastatic and ‘unseen’ cancer types. (b) Schematic overview of the four tested models: pointwise Resnet (Resnet), Resnet extended with Monte Carlo Dropout (MCD), MCD extended with bi-Lipschitz constraint (Bilipschitz), and an ensemble of Bilipschitz models (Ensemble). Note, Resnet represents a single point in function space (blue dot), while two Bayesian models (MCD and Bilipschitz) represent a distribution within a single region in function space (green dots). The Ensemble represents a collection of distributions centred around different modes (red dots).

**Figure 2**
Out-of-distribution overconfidence of a pointwise baseline Resnet model and three simple Bayesian models on ‘seen’ data. (a) Micro-F1 score (i.e., Accuracy) of all models on the IID validation data (left) and on ‘seen’ OOD data (right). Accuracy for (IID) validation data was controlled with early stopping. (b) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, H) for individual samples on IID data (left) and on ‘seen’ OOD data (right). Sample median is depicted by horizontal line, while the sample mean is depicted by the grey star. Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. (c) Each model’s confidence vs accuracy of each ECE-bin on ‘seen’ OOD data. The black diagonal lines illustrate perfect calibration, i.e., no overconfidence. ECE value for each model shown in parentheses. The residuals are colour-coded by the (left) colour scale and represent the difference between confidence and accuracy for each bin. (d) Box plot of each model’s absolute calibration error of individual samples on IID data (left) and ‘seen’ OOD data (right). Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively.

**Figure 3**
Total uncertainties for out-of-distribution data with cancer types ‘seen’ and ‘unseen’ in training. (a) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, $H$ ) on OOD data with cancer types ‘seen’ (left) and ‘unseen’ (right) during training. Statistical significance (two-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. Stars denoted mean, the horizontal centre lines denoted median, and notches— the 95% confidence interval of the median total uncertainty. (b) Total uncertainty values for the ‘unseen’ classes. The horizontal red lines denoted median total uncertainty values.

**Figure 4**
Evaluation of model generalisability from development to production. (a) F1-Retention Curves and corresponding F1-AUC scores. The F1-Retention curve of the (baseline) Resnet model and three approximate Bayesian models (MCD, Bilipschitz, Ensemble). As the retention fraction decreases, more of the most uncertain predictions are replaced with the ground truth. Thus, steeper curves require stronger correlation between uncertainty and the error-rate. The F1-Retention Area Under the Curve (F1-AUC) for each model are detailed in the legend. The F1-AUC is a function of both predictive performance (micro-F1), and the uncertainty error-rate correlation. (b) Development and Production F1-Uncertainty curves for each model. The figure illustrates the development F1^(IID)-Uncertainty curves (continuous lines), as well as the production F1^(OOD)-Uncertainty curves (dashed lines). Black lines illustrate the F1 decrease from a single development F1 score with F1_dev = 98.5% for all models. The Area Between the Development and Production Curve (ADP) is shown as the coloured region. (c) Area Between the Development and Production Curves (ADP) bar plot with bootstrapped confidence intervals. ADP is the averaged F1 decrease calculated between F1_dev = 97.5% and F1_dev = 99.0% at intervals of 0.001%. Steps for calculating the ADP are detailed in the Methods.

See this image and copyright information in PMC

References

1. Cao C, et al. Deep learning and its applications in biomedicine. Genom. Proteom. Bioinform. 2018;16(1):17–32. doi: 10.1016/j.gpb.2017.07.003. - DOI - PMC - PubMed
1. Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021;13(1):152. doi: 10.1186/s13073-021-00968-x. - DOI - PMC - PubMed
1. Wang M, Zhang Q, Lam S, Cai J, Yang R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front. Oncol. 2020 doi: 10.3389/fonc.2020.580919. - DOI - PMC - PubMed
1. Zhu W, Xie L, Han J, Guo X. The application of deep learning in cancer prognosis prediction. Cancers. 2020;12(3):603. doi: 10.3390/cancers12030603. - DOI - PMC - PubMed
1. Schelb P, et al. Classification of cancer at prostate MRI: Deep learning versus clinical PI-RADS assessment. Radiology. 2019;293(3):607–617. doi: 10.1148/radiol.2019190938. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Affiliations

Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources