Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Apr 14;24(4):551.
doi: 10.3390/e24040551.

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Affiliations
Review

Survey on Self-Supervised Learning: Auxiliary Pretext Tasks and Contrastive Learning Methods in Imaging

Saleh Albelwi. Entropy (Basel). .

Abstract

Although deep learning algorithms have achieved significant progress in a variety of domains, they require costly annotations on huge datasets. Self-supervised learning (SSL) using unlabeled data has emerged as an alternative, as it eliminates manual annotation. To do this, SSL constructs feature representations using pretext tasks that operate without manual annotation, which allows models trained in these tasks to extract useful latent representations that later improve downstream tasks such as object classification and detection. The early methods of SSL are based on auxiliary pretext tasks as a way to learn representations using pseudo-labels, or labels that were created automatically based on the dataset's attributes. Furthermore, contrastive learning has also performed well in learning representations via SSL. To succeed, it pushes positive samples closer together, and negative ones further apart, in the latent space. This paper provides a comprehensive literature review of the top-performing SSL methods using auxiliary pretext and contrastive learning techniques. It details the motivation for this research, a general pipeline of SSL, the terminologies of the field, and provides an examination of pretext tasks and self-supervised methods. It also examines how self-supervised methods compare to supervised ones, and then discusses both further considerations and ongoing challenges faced by SSL.

Keywords: auxiliary pretext tasks; contrastive learning; contrastive loss; data augmentation; downstream tasks; encoder; pretext tasks; self-supervised learning (SSL).

PubMed Disclaimer

Conflict of interest statement

The author declares no conflict of interest.

Figures

Figure 1
Figure 1
The workflows of SSL and TL. The workflows of SSL and TL are similar, with only slight differences. The key difference between TL and SSL is that TL pre-trains on labeled data, whereas SSL utilizes unlabeled to learn features, as shown in the first step. In the second step, SSL and TL are the same: both techniques are further trained on the target task, but we need only a small number of labelled examples.
Figure 2
Figure 2
Several examples of pretext tasks. Pretext tasks easily generate pseudo-labels from the data (images) itself. Solving pretext tasks allows the model to extract useful latent representations that later improve the downstream tasks.
Figure 3
Figure 3
Different methods of data augmentation. It is common practice to combine multiple types of data augmentation (e.g., cropping, resizing, and flipping) for higher-quality learning and better latent features.
Figure 4
Figure 4
Different models of context prediction. (a) A pair of patches extracted randomly from each image train the CNN to identify a neighboring patch’s position in contrast to the initial patch. The weights between both CNNs are shared. (b) Learning representation by solving jigsaw puzzles with 3 × 3 patches. (a) is the original image; (b) is the puzzle created by shuffling the tiles using a pre-defined permutation; (c) is the feeding of shuffled patches into a CNF network trained to recognize permutations. (c) An illustration of SSL using the rotation of an input image. The model learns to predict the correct rotation from four possible angles (0, 90, 180, or 270 degrees). (d) proposes object counting as a pretext task for learning feature representation, thereby training a CNN architecture to count.
Figure 5
Figure 5
Colorization as a pretext task for learning representation. The CNN is trained to produce real-color images from a grayscale input image.
Figure 6
Figure 6
The architecture of a BiGAN. Using this technique, both (z and E(x)) and (G(z) and x) have the same dimensions. The concatenated pairs [G(z), z]  and [x, E(x)] are the two inputs of the discriminator D. Both the generator G and encoder E are optimized using the loss created by the discriminator D.
Figure 7
Figure 7
The architecture of the context encoder. It is a simple encoder–decoder pipeline.
Figure 8
Figure 8
A split-brain autoencoder architecture. Comprised of two sub-networks, F1 and F2, the model is trained to predict data using the other network’s hypothesis to complement its own. Combining both hypotheses predicts the full image.
Figure 9
Figure 9
An illustration of the DeepCluster pipeline for learning representations.
Figure 10
Figure 10
The structure of SimCLR. Data augmentation T(.) is applied on the input image x  to generate two augmented images x1+ and x2+. A base encoder network f(.)  and a projection head g(.) are trained to maximize the similarity between the augmented images using contrastive loss. After completing the training process, the representation h is used for downstream tasks.
Figure 11
Figure 11
The structure of MoCo. MoCo uses two encoders, an encoder and a momentum encoder.
Figure 12
Figure 12
A BYOL architecture. BYOL reduces similarity loss between qθ(zθ ) and sg (zξ' ), where θ represents the trained weights, ξ represents an exponential moving average of θ, and sg means the stop-gradient. After training, everything but fθ is discarded. yθ represents the image representation.
Figure 13
Figure 13
The architecture of SimSiam. Two augmented images passed through the same encoder, which is comprised of a backbone (ResNet) and a projection MLP. A prediction MLP (h) is used on one side, and a stop-gradient strategy is employed on the other to avoid collapse. The model aims to maximize the similarity between both views. SimSiam does not use negative pairs or a momentum encoder.
Figure 14
Figure 14
The structure of SwAV. It assigns a code to an augmented image and then anticipates that code by using a second augmentation of that image. SwAV ensures consistency by comparing features directly, the same as contrastive learning.
Figure 15
Figure 15
The accuracy of the ImageNet Top-1 linear classifiers, which were trained on feature representations created via self-supervised techniques with different widths of ResNet 50. All were pretrained on ImageNet.

References

    1. Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. 20141409.1556
    1. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 770–778.
    1. Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., Berg A.C. European Conference on Computer Vision. Springer; Berlin, Germany: 2016. Ssd: Single shot multibox detector; pp. 21–37.
    1. Chen L.-C., Papandreou G., Kokkinos I., Murphy K., Yuille A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017;40:834–848. doi: 10.1109/TPAMI.2017.2699184. - DOI - PubMed
    1. He K., Gkioxari G., Dollár P., Girshick R. Mask r-cnn; Proceedings of the IEEE International Conference on Computer Vision; Honolulu, HI, USA. 21–26 July 2017; pp. 2961–2969.

LinkOut - more resources