Review

. 2022 Apr 14;24(4):551.

doi: 10.3390/e24040551.

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Saleh Albelwi^{1

2}

Affiliations

¹ Faculty of Computing and Information Technology, University of Tabuk, Tabuk 47731, Saudi Arabia.
² Industrial Innovation and Robotic Center (IIRC), University of Tabuk, Tabuk 47731, Saudi Arabia.

PMID: 35455214
PMCID: PMC9029566
DOI: 10.3390/e24040551

Review

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Saleh Albelwi. Entropy (Basel). 2022.

. 2022 Apr 14;24(4):551.

doi: 10.3390/e24040551.

Author

Saleh Albelwi^{1

2}

Affiliations

¹ Faculty of Computing and Information Technology, University of Tabuk, Tabuk 47731, Saudi Arabia.
² Industrial Innovation and Robotic Center (IIRC), University of Tabuk, Tabuk 47731, Saudi Arabia.

PMID: 35455214
PMCID: PMC9029566
DOI: 10.3390/e24040551

Abstract

Although deep learning algorithms have achieved significant progress in a variety of domains, they require costly annotations on huge datasets. Self-supervised learning (SSL) using unlabeled data has emerged as an alternative, as it eliminates manual annotation. To do this, SSL constructs feature representations using pretext tasks that operate without manual annotation, which allows models trained in these tasks to extract useful latent representations that later improve downstream tasks such as object classification and detection. The early methods of SSL are based on auxiliary pretext tasks as a way to learn representations using pseudo-labels, or labels that were created automatically based on the dataset's attributes. Furthermore, contrastive learning has also performed well in learning representations via SSL. To succeed, it pushes positive samples closer together, and negative ones further apart, in the latent space. This paper provides a comprehensive literature review of the top-performing SSL methods using auxiliary pretext and contrastive learning techniques. It details the motivation for this research, a general pipeline of SSL, the terminologies of the field, and provides an examination of pretext tasks and self-supervised methods. It also examines how self-supervised methods compare to supervised ones, and then discusses both further considerations and ongoing challenges faced by SSL.

Keywords: auxiliary pretext tasks; contrastive learning; contrastive loss; data augmentation; downstream tasks; encoder; pretext tasks; self-supervised learning (SSL).

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

**Figure 1**
The workflows of SSL and TL. The workflows of SSL and TL are similar, with only slight differences. The key difference between TL and SSL is that TL pre-trains on labeled data, whereas SSL utilizes unlabeled to learn features, as shown in the first step. In the second step, SSL and TL are the same: both techniques are further trained on the target task, but we need only a small number of labelled examples.

**Figure 2**
Several examples of pretext tasks. Pretext tasks easily generate pseudo-labels from the data (images) itself. Solving pretext tasks allows the model to extract useful latent representations that later improve the downstream tasks.

**Figure 3**
Different methods of data augmentation. It is common practice to combine multiple types of data augmentation (e.g., cropping, resizing, and flipping) for higher-quality learning and better latent features.

**Figure 4**
Different models of context prediction. (a) A pair of patches extracted randomly from each image train the CNN to identify a neighboring patch’s position in contrast to the initial patch. The weights between both CNNs are shared. (b) Learning representation by solving jigsaw puzzles with 3 × 3 patches. (a) is the original image; (b) is the puzzle created by shuffling the tiles using a pre-defined permutation; (c) is the feeding of shuffled patches into a CNF network trained to recognize permutations. (c) An illustration of SSL using the rotation of an input image. The model learns to predict the correct rotation from four possible angles (0, 90, 180, or 270 degrees). (d) proposes object counting as a pretext task for learning feature representation, thereby training a CNN architecture to count.

**Figure 5**
Colorization as a pretext task for learning representation. The CNN is trained to produce real-color images from a grayscale input image.

**Figure 6**
The architecture of a BiGAN. Using this technique, both ( $z$ and $E (x))$ and ( $G (z)$ and $x$ ) have the same dimensions. The concatenated pairs $[G (z), z]$ and $[x, E (x)]$ are the two inputs of the discriminator $D$ . Both the generator $G$ and encoder $E$ are optimized using the loss created by the discriminator $D$ .

**Figure 7**
The architecture of the context encoder. It is a simple encoder–decoder pipeline.

**Figure 8**
A split-brain autoencoder architecture. Comprised of two sub-networks, F₁ and F₂, the model is trained to predict data using the other network’s hypothesis to complement its own. Combining both hypotheses predicts the full image.

**Figure 9**
An illustration of the DeepCluster pipeline for learning representations.

**Figure 10**
The structure of SimCLR. Data augmentation $T (.)$ is applied on the input image $x$ to generate two augmented images $x_{1}^{+}$ and $x_{2}^{+}$ . A base encoder network $f (.)$ and a projection head $g (.)$ are trained to maximize the similarity between the augmented images using contrastive loss. After completing the training process, the representation $h$ is used for downstream tasks.

**Figure 11**
The structure of MoCo. MoCo uses two encoders, an encoder and a momentum encoder.

**Figure 12**
A BYOL architecture. BYOL reduces similarity loss between $q_{θ}$ ( $z_{θ}$ ) and $s g$ ( $z_{ξ}^{'}$ ), where $θ$ represents the trained weights, ξ represents an exponential moving average of $θ$ , and sg means the stop-gradient. After training, everything but $f_{θ}$ is discarded. $y_{θ}$ represents the image representation.

**Figure 13**
The architecture of SimSiam. Two augmented images passed through the same encoder, which is comprised of a backbone (ResNet) and a projection MLP. A prediction MLP ( $h)$ is used on one side, and a stop-gradient strategy is employed on the other to avoid collapse. The model aims to maximize the similarity between both views. SimSiam does not use negative pairs or a momentum encoder.

**Figure 14**
The structure of SwAV. It assigns a code to an augmented image and then anticipates that code by using a second augmentation of that image. SwAV ensures consistency by comparing features directly, the same as contrastive learning.

**Figure 15**
The accuracy of the ImageNet Top-1 linear classifiers, which were trained on feature representations created via self-supervised techniques with different widths of ResNet 50. All were pretrained on ImageNet.

See this image and copyright information in PMC

References

1. Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. 20141409.1556
1. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778.
1. Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., Berg A.C. European Conference on Computer Vision. Springer; Berlin, Germany: 2016. Ssd: Single shot multibox detector; pp. 21–37.
1. Chen L.-C., Papandreou G., Kokkinos I., Murphy K., Yuille A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017;40:834–848. doi: 10.1109/TPAMI.2017.2699184. - DOI - PubMed
1. He K., Gkioxari G., Dollár P., Girshick R. Mask r-cnn; Proceedings of the IEEE International Conference on Computer Vision; Honolulu, HI, USA. 21–26 July 2017; pp. 2961–2969.

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Affiliations

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Author

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources