Semi-supervised learning in cancer diagnostics

Jan-Niklas Eckardt^{1

2}, Martin Bornhäuser^{1

3

4}, Karsten Wendt^{2

5}, Jan Moritz Middeke^{1

2}

Affiliations

¹ Department of Internal Medicine I, University Hospital Carl Gustav Carus, Dresden, Germany.
² Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
³ German Consortium for Translational Cancer Research, Heidelberg, Germany.
⁴ National Center for Tumor Disease (NCT), Dresden, Germany.
⁵ Institute of Software and Multimedia Technology, Technical University Dresden, Dresden, Germany.

PMID: 35912249
PMCID: PMC9329803
DOI: 10.3389/fonc.2022.960984

Review

Semi-supervised learning in cancer diagnostics

Jan-Niklas Eckardt et al. Front Oncol. 2022.

. 2022 Jul 14:12:960984.

doi: 10.3389/fonc.2022.960984. eCollection 2022.

Authors

Jan-Niklas Eckardt^{1

2}, Martin Bornhäuser^{1

3

4}, Karsten Wendt^{2

5}, Jan Moritz Middeke^{1

2}

Affiliations

¹ Department of Internal Medicine I, University Hospital Carl Gustav Carus, Dresden, Germany.
² Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
³ German Consortium for Translational Cancer Research, Heidelberg, Germany.
⁴ National Center for Tumor Disease (NCT), Dresden, Germany.
⁵ Institute of Software and Multimedia Technology, Technical University Dresden, Dresden, Germany.

PMID: 35912249
PMCID: PMC9329803
DOI: 10.3389/fonc.2022.960984

Abstract

In cancer diagnostics, a considerable amount of data is acquired during routine work-up. Recently, machine learning has been used to build classifiers that are tasked with cancer detection and aid in clinical decision-making. Most of these classifiers are based on supervised learning (SL) that needs time- and cost-intensive manual labeling of samples by medical experts for model training. Semi-supervised learning (SSL), however, works with only a fraction of labeled data by including unlabeled samples for information abstraction and thus can utilize the vast discrepancy between available labeled data and overall available data in cancer diagnostics. In this review, we provide a comprehensive overview of essential functionalities and assumptions of SSL and survey key studies with regard to cancer care differentiating between image-based and non-image-based applications. We highlight current state-of-the-art models in histopathology, radiology and radiotherapy, as well as genomics. Further, we discuss potential pitfalls in SSL study design such as discrepancies in data distributions and comparison to baseline SL models, and point out future directions for SSL in oncology. We believe well-designed SSL models to strongly contribute to computer-guided diagnostics in malignant disease by overcoming current hinderances in the form of sparse labeled and abundant unlabeled data.

Keywords: artificial intelligence; cancer; diagnostics; machine learning; semi-supervised learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Inputs and Outputs of supervised, unsupervised and semi-supervised learning. In supervised learning **(A)** all data is labeled. Labels are used to train a classifier to map learned labels to previously unseen data. Unsupervised learning **(B)** does not use labels. Data is being clustered into groups based on inherent patterns. Semi-supervised learning **(C)** uses both labeled and unlabeled data. Labels are used to train a classifier which is augmented by unlabeled data of the same distribution to derive additional information in order to boost performance.

**Figure 2**
How does unlabeled data boost classification performance? Consider a number of features n at the input level which corresponds to an n-dimensional feature space. In such an n-dimensional coordinate system, every input is located according to its feature vector given by its n features and can thus be sorted by similarities and differences in relation to other inputs which is represented by proximity or distance points in the feature space. For clarity reasons, we only consider two features (x, y) in a two-dimensional feature space. When labeled data is sparse **(A)**, as is often the case in medical data sets, the decision boundary of a classifier is less constraint. This may lead to inaccuracies and poor generalization on external data. If many labels are given, the decision boundary is more constraint and thus a more accurate classifier is given that can potentially generalize better. However, manual labeling of such large data sets is often time- and cost-ineffective. Unlabeled data is often available in abundance **(C)** and can be used to constrain the decision boundary of a classifier in a way as large labeled data sets could do, however, without the need for excessive labeling. The decision boundary then lies in an area with low density. Nevertheless, as can be derived from **(B)** and **(C)**, the performance gap between supervised and semi-supervised learning shrinks as the amount of labeled data grows if no further unlabeled samples are provided.

See this image and copyright information in PMC

References

1. Zhang X, Lin D, Pforsich H, Lin VW. Physician workforce in the united states of America: forecasting nationwide shortages. Hum Resour Health (2020) 18:8. doi: 10.1186/s12960-020-0448-3 - DOI - PMC - PubMed
1. Metter DM, Colgan TJ, Leung ST, Timmons CF, Park JY. Trends in the US and Canadian pathologist workforces from 2007 to 2017. JAMA Netw Open (2019) 2:e194337. doi: 10.1001/jamanetworkopen.2019.4337 - DOI - PMC - PubMed
1. van Engelen JE, Hoos HH. A survey on semi-supervised learning. Mach Learn (2020) 109:373–440. doi: 10.1007/s10994-019-05855-6 - DOI
1. Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. . Preparing medical imaging data for machine learning. Radiology (2020) 295:4–15. doi: 10.1148/radiol.2020192224 - DOI - PMC - PubMed
1. Triguero I, García S, Herrera F. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst (2015) 42:245–84. doi: 10.1007/s10115-013-0706-y - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Semi-supervised learning in cancer diagnostics

Affiliations

Semi-supervised learning in cancer diagnostics

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources