Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 6;13(1):7395.
doi: 10.1038/s41598-023-31126-5.

Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Affiliations

Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology

Samual MacDonald et al. Sci Rep. .

Abstract

Uncertainty estimation is crucial for understanding the reliability of deep learning (DL) predictions, and critical for deploying DL in the clinic. Differences between training and production datasets can lead to incorrect predictions with underestimated uncertainty. To investigate this pitfall, we benchmarked one pointwise and three approximate Bayesian DL models for predicting cancer of unknown primary, using three RNA-seq datasets with 10,968 samples across 57 cancer types. Our results highlight that simple and scalable Bayesian DL significantly improves the generalisation of uncertainty estimation. Moreover, we designed a prototypical metric-the area between development and production curve (ADP), which evaluates the accuracy loss when deploying models from development to production. Using ADP, we demonstrate that Bayesian DL improves accuracy under data distributional shifts when utilising 'uncertainty thresholding'. In summary, Bayesian DL is a promising approach for generalising uncertainty, improving performance, transparency, and safety of DL models for deployment in the real world.

PubMed Disclaimer

Conflict of interest statement

M.Y., H.F., S.M., K.S. and M.T. are employed by Max Kelsen, which is a commercial company with an embedded research team. J.V.P. and N.W. are founders and shareholders of genomiQa Pty Ltd, and members of its Board. S.S., A.B., O.K., V.A., S.W, L.T.K. and R.L.J have no competing interests.

Figures

Figure 1
Figure 1
Overview of the study design. (a) Simplified study workflow. TCGA primary cancer types comprised the training and IID validation data. OOD test data comprised of the TCGA (metastatic cancer types), Met500 and ICD datasets, which included primary, metastatic and ‘unseen’ cancer types. (b) Schematic overview of the four tested models: pointwise Resnet (Resnet), Resnet extended with Monte Carlo Dropout (MCD), MCD extended with bi-Lipschitz constraint (Bilipschitz), and an ensemble of Bilipschitz models (Ensemble). Note, Resnet represents a single point in function space (blue dot), while two Bayesian models (MCD and Bilipschitz) represent a distribution within a single region in function space (green dots). The Ensemble represents a collection of distributions centred around different modes (red dots).
Figure 2
Figure 2
Out-of-distribution overconfidence of a pointwise baseline Resnet model and three simple Bayesian models on ‘seen’ data. (a) Micro-F1 score (i.e., Accuracy) of all models on the IID validation data (left) and on ‘seen’ OOD data (right). Accuracy for (IID) validation data was controlled with early stopping. (b) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, H) for individual samples on IID data (left) and on ‘seen’ OOD data (right). Sample median is depicted by horizontal line, while the sample mean is depicted by the grey star. Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. (c) Each model’s confidence vs accuracy of each ECE-bin on ‘seen’ OOD data. The black diagonal lines illustrate perfect calibration, i.e., no overconfidence. ECE value for each model shown in parentheses. The residuals are colour-coded by the (left) colour scale and represent the difference between confidence and accuracy for each bin. (d) Box plot of each model’s absolute calibration error of individual samples on IID data (left) and ‘seen’ OOD data (right). Statistical significance (single-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively.
Figure 3
Figure 3
Total uncertainties for out-of-distribution data with cancer types ‘seen’ and ‘unseen’ in training. (a) Box plot of each model’s predictive uncertainty (Shannon’s Entropy, H) on OOD data with cancer types ‘seen’ (left) and ‘unseen’ (right) during training. Statistical significance (two-sided Wilcoxon rank-sum) between baseline and each Bayesian model are marked with denoted *, **, ***, for p value < 0.05, p value < 0.01, and p value < 0.001, respectively. Stars denoted mean, the horizontal centre lines denoted median, and notches— the 95% confidence interval of the median total uncertainty. (b) Total uncertainty values for the ‘unseen’ classes. The horizontal red lines denoted median total uncertainty values.
Figure 4
Figure 4
Evaluation of model generalisability from development to production. (a) F1-Retention Curves and corresponding F1-AUC scores. The F1-Retention curve of the (baseline) Resnet model and three approximate Bayesian models (MCD, Bilipschitz, Ensemble). As the retention fraction decreases, more of the most uncertain predictions are replaced with the ground truth. Thus, steeper curves require stronger correlation between uncertainty and the error-rate. The F1-Retention Area Under the Curve (F1-AUC) for each model are detailed in the legend. The F1-AUC is a function of both predictive performance (micro-F1), and the uncertainty error-rate correlation. (b) Development and Production F1-Uncertainty curves for each model. The figure illustrates the development F1(IID)-Uncertainty curves (continuous lines), as well as the production F1(OOD)-Uncertainty curves (dashed lines). Black lines illustrate the F1 decrease from a single development F1 score with F1dev = 98.5% for all models. The Area Between the Development and Production Curve (ADP) is shown as the coloured region. (c) Area Between the Development and Production Curves (ADP) bar plot with bootstrapped confidence intervals. ADP is the averaged F1 decrease calculated between F1dev = 97.5% and F1dev = 99.0% at intervals of 0.001%. Steps for calculating the ADP are detailed in the Methods.

References

    1. Cao C, et al. Deep learning and its applications in biomedicine. Genom. Proteom. Bioinform. 2018;16(1):17–32. doi: 10.1016/j.gpb.2017.07.003. - DOI - PMC - PubMed
    1. Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021;13(1):152. doi: 10.1186/s13073-021-00968-x. - DOI - PMC - PubMed
    1. Wang M, Zhang Q, Lam S, Cai J, Yang R. A review on application of deep learning algorithms in external beam radiotherapy automated treatment planning. Front. Oncol. 2020 doi: 10.3389/fonc.2020.580919. - DOI - PMC - PubMed
    1. Zhu W, Xie L, Han J, Guo X. The application of deep learning in cancer prognosis prediction. Cancers. 2020;12(3):603. doi: 10.3390/cancers12030603. - DOI - PMC - PubMed
    1. Schelb P, et al. Classification of cancer at prostate MRI: Deep learning versus clinical PI-RADS assessment. Radiology. 2019;293(3):607–617. doi: 10.1148/radiol.2019190938. - DOI - PubMed

Publication types