Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Jul 14:12:960984.
doi: 10.3389/fonc.2022.960984. eCollection 2022.

Semi-supervised learning in cancer diagnostics

Affiliations
Review

Semi-supervised learning in cancer diagnostics

Jan-Niklas Eckardt et al. Front Oncol. .

Abstract

In cancer diagnostics, a considerable amount of data is acquired during routine work-up. Recently, machine learning has been used to build classifiers that are tasked with cancer detection and aid in clinical decision-making. Most of these classifiers are based on supervised learning (SL) that needs time- and cost-intensive manual labeling of samples by medical experts for model training. Semi-supervised learning (SSL), however, works with only a fraction of labeled data by including unlabeled samples for information abstraction and thus can utilize the vast discrepancy between available labeled data and overall available data in cancer diagnostics. In this review, we provide a comprehensive overview of essential functionalities and assumptions of SSL and survey key studies with regard to cancer care differentiating between image-based and non-image-based applications. We highlight current state-of-the-art models in histopathology, radiology and radiotherapy, as well as genomics. Further, we discuss potential pitfalls in SSL study design such as discrepancies in data distributions and comparison to baseline SL models, and point out future directions for SSL in oncology. We believe well-designed SSL models to strongly contribute to computer-guided diagnostics in malignant disease by overcoming current hinderances in the form of sparse labeled and abundant unlabeled data.

Keywords: artificial intelligence; cancer; diagnostics; machine learning; semi-supervised learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Inputs and Outputs of supervised, unsupervised and semi-supervised learning. In supervised learning (A) all data is labeled. Labels are used to train a classifier to map learned labels to previously unseen data. Unsupervised learning (B) does not use labels. Data is being clustered into groups based on inherent patterns. Semi-supervised learning (C) uses both labeled and unlabeled data. Labels are used to train a classifier which is augmented by unlabeled data of the same distribution to derive additional information in order to boost performance.
Figure 2
Figure 2
How does unlabeled data boost classification performance? Consider a number of features n at the input level which corresponds to an n-dimensional feature space. In such an n-dimensional coordinate system, every input is located according to its feature vector given by its n features and can thus be sorted by similarities and differences in relation to other inputs which is represented by proximity or distance points in the feature space. For clarity reasons, we only consider two features (x, y) in a two-dimensional feature space. When labeled data is sparse (A), as is often the case in medical data sets, the decision boundary of a classifier is less constraint. This may lead to inaccuracies and poor generalization on external data. If many labels are given, the decision boundary is more constraint and thus a more accurate classifier is given that can potentially generalize better. However, manual labeling of such large data sets is often time- and cost-ineffective. Unlabeled data is often available in abundance (C) and can be used to constrain the decision boundary of a classifier in a way as large labeled data sets could do, however, without the need for excessive labeling. The decision boundary then lies in an area with low density. Nevertheless, as can be derived from (B) and (C), the performance gap between supervised and semi-supervised learning shrinks as the amount of labeled data grows if no further unlabeled samples are provided.

References

    1. Zhang X, Lin D, Pforsich H, Lin VW. Physician workforce in the united states of America: forecasting nationwide shortages. Hum Resour Health (2020) 18:8. doi: 10.1186/s12960-020-0448-3 - DOI - PMC - PubMed
    1. Metter DM, Colgan TJ, Leung ST, Timmons CF, Park JY. Trends in the US and Canadian pathologist workforces from 2007 to 2017. JAMA Netw Open (2019) 2:e194337. doi: 10.1001/jamanetworkopen.2019.4337 - DOI - PMC - PubMed
    1. van Engelen JE, Hoos HH. A survey on semi-supervised learning. Mach Learn (2020) 109:373–440. doi: 10.1007/s10994-019-05855-6 - DOI
    1. Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, et al. . Preparing medical imaging data for machine learning. Radiology (2020) 295:4–15. doi: 10.1148/radiol.2020192224 - DOI - PMC - PubMed
    1. Triguero I, García S, Herrera F. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl Inf Syst (2015) 42:245–84. doi: 10.1007/s10115-013-0706-y - DOI

LinkOut - more resources