Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Apr:85:102762.
doi: 10.1016/j.media.2023.102762. Epub 2023 Jan 31.

Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives

Affiliations
Review

Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives

Jun Li et al. Med Image Anal. 2023 Apr.

Abstract

Transformer, one of the latest technological advances of deep learning, has gained prevalence in natural language processing or computer vision. Since medical imaging bear some resemblance to computer vision, it is natural to inquire about the status quo of Transformers in medical imaging and ask the question: can the Transformer models transform medical imaging? In this paper, we attempt to make a response to the inquiry. After a brief introduction of the fundamentals of Transformers, especially in comparison with convolutional neural networks (CNNs), and highlighting key defining properties that characterize the Transformers, we offer a comprehensive review of the state-of-the-art Transformer-based approaches for medical imaging and exhibit current research progresses made in the areas of medical image segmentation, recognition, detection, registration, reconstruction, enhancement, etc. In particular, what distinguishes our review lies in its organization based on the Transformer's key defining properties, which are mostly derived from comparing the Transformer and CNN, and its type of architecture, which specifies the manner in which the Transformer and CNN are combined, all helping the readers to best understand the rationale behind the reviewed approaches. We conclude with discussions of future perspectives.

Keywords: Medical imaging; Survey; Transformer.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1.
Fig. 1.
Details of a self-attention mechanism (left) and a multi-head self-attention (MSA) (right). Compared to self-attention, the MSA conducts several attention modules in parallel. The independent attention features are then concatenated and linearly transformed to the output.
Fig. 2.
Fig. 2.
Overview of Vision Transformer (left) and illustration of the Transformer encoder (right). The strategy for partitioning an image involves dividing it into several patches of a fixed size, which are then treated as sequences using an efficient Transformer implementation from NLP.
Fig. 3.
Fig. 3.
Taxonomy of typical approaches in combining CNNs and Transformer.
Fig. 4.
Fig. 4.
Effective receptive fields (ERFs) (Luo et al., 2016) of the well-known CNN, U-Net (Ronneberger et al., 2015), versus the hybrid Transformer-CNN models, including UNETR (Hatamizadeh et al., 2019), Medical Transformer (Valanarasu et al., 2021), TransMorph (Chen et al., 2022b), and ReconFormer (Guo et al., 2022d). The ERFs are computed at the last layer of the model prior to the output. The γ correction of γ = 0.4 was applied to the ERFs for better visualization. Despite the fact that its theoretical receptive field encompasses the whole image, the pure CNN model, U-Net (Ronneberger et al., 2015), has a limited ERF, with gradient magnitude rapidly decreasing away from the center. On the other hand, all Transformer-based models have large ERFs that span over the entire image.
Fig. 5.
Fig. 5.
Loss landscapes for the models based on CNNs versus Transformers. The left and right panels depict, respectively, the loss landscapes for registration and segmentation models. The left panel shows loss landscapes generated based on normalized cross-correlation loss and a diffusion regularizer; the right panel shows loss landscapes created based on a combination of Dice and cross-entropy losses. Transformer-based models, such as (b) TransMorph (Chen et al., 2022b) and (d) UNETR (Hatamizadeh et al., 2022b), exhibit flatter loss landscapes than CNN-based models, such as (a) VoxelMorph (Balakrishnan et al., 2019) and (c) U-Net (Ronneberger et al., 2015).
Fig. 6.
Fig. 6.
(a) The number of papers accepted to the MICCAI conference from 2020 to 2022 whose titles included the word "Transformer". (b) Sources of all 114 selected papers.
Fig. 7.
Fig. 7.
An overview of Transformers applied in medical tasks in segmentation, recognition & classification, detection, registration, reconstruction, and enhancement.
Fig. 8.
Fig. 8.
Typical Transformer-based U-shaped segmentation model architectures. (a) The TransUNet (Chen et al., 2021d)-like structure uses Transformer as additional encoder modeling bottleneck features. (b) The Swin UNETR (Tang et al., 2022) uses the Transformer as the main encoder and CNN decoder to construct the hybrid network. (c) The TransFuse (Zhang et al., 2021b) fuses CNN and Transformer encoders together to connect the decoder. (d) The nnFormer (Zhou et al., 2021a)-like structure uses a pure Transformer for both encoder and decoder.
Fig. 9.
Fig. 9.
Visualization of CT/MRI segmentation and comparison on public datasets between Transformer-based and baseline models. Transformer-based models includes Swin UNETR (Tang et al., 2022), T-AutoML (Yang et al., 2021), TransBTSV2 (Li et al., 2022c), AFTer-UNet (Yan et al., 2022), U- Transformer (Petit et al., 2021), UNesT (Yu et al., 2022c), BiTr-Unet (Jia and Shu, 2021), UTNet (Gao et al., 2021b), nnFormer (Zhou et al., 2021a), MOCOv3 (Chen et al., 2021f), and DINO (Caron et al., 2021), USST (Xie et al., 2021c). Baseline models contain the DiNTS (He et al., 2021b), ResUNet (Zhang et al., 2018), and AttentionUNet (Oktay et al., 2018)
Fig. 10.
Fig. 10.
Transformer segmentation to other medical image modalities such as endoscopy, microscopy, retinopathy, ultrasound, X-ray, and camera images. The comparison methods include Pyramid Trans (Zhang et al., 2021d), MBT-Net (Zhang et al., 2021a), MCTrans (Ji et al., 2021), X-Net (Li et al., 2021d), TransAttUnet (Chen et al., 2021a), MedT (Valanarasu et al., 2021), Swin-UNet (Nguyen et al., 2021), SpecTr (Yun et al., 2021), RT-Net (Huang et al., 2022b), and ConvNet-based models (ResUNet (Zhang et al., 2018), UNet (Ronneberger et al., 2015), UNet++ (Zhou et al., 2018), and AttentionUNet (Oktay et al., 2018)).
Fig. 11.
Fig. 11.
The schematic illustration of the Transformer-based image registration networks. (a) ViT-V-Net (Chen et al., 2021c). (b) TransMorph (Chen et al., 2022b). (c) PC-SwinMorph (Liu et al., 2022a). (d) DTN (Zhang et al., 2021c). These network architectures are based predominately on the hybrid ConvNet-Transformer design.
Fig. 12.
Fig. 12.
We illustrate the Transformer-based networks of (a) ReconFormer (Guo et al., 2022d) (b) DuDoTrans (Wang et al., 2021a) (c) T2Net (Feng et al., 2021) and (d) TransCT (Zhang et al., 2021e). (a) and (b) are reconstruction models, (c) and (d) are for enhancement. These structures are based on the hybrid ConvNet-Transformer design.
Fig. 13.
Fig. 13.
We visualize reconstructions of Transformer-based DuDoTrans (Wang et al., 2021a) versus ConvNet with 72 and 96 sparse views on NIH-AAPM-Mayo (McCollough, 2016) dataset, and the zoom-in images are shown in the last row. With the included Property M2, Transformer-based DuDoTrans obtains better overall performances, especially on bones, and alleviates the FBP artifacts. While the recovered soft tissues are not as sharp as ConvNets results.

Similar articles

Cited by

References

    1. Akhloufi MA, Chetoui M, 2021. Chest XR COVID-19 detection. https://cxr-covid19.grand-challenge.org/. Online; accessed September 2021.
    1. Alexander LM, Escalera J, Ai L, Andreotti C, Febre K, Mangone A, Vega-Potler N, Langer N, Alexander A, Kovacs M, et al., 2017. An open resource for transdiagnostic research in pediatric mental health and learning disorders. Scientific data 4, 1–26. - PMC - PubMed
    1. Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK, 2018. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955 .
    1. Ambellan F, Tack A, Ehlke M, Zachow S, 2019. Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: Data from the osteoarthritis initiative. Medical image analysis 52, 109–118. - PubMed
    1. Anandarajah S, Tai T, de Lusignan S, Stevens P, O’Donoghue D, Walker M, Hilton S, 2005. The validity of searching routinely collected general practice computer data to identify patients with chronic kidney disease (ckd): a manual review of 500 medical records. Nephrology Dialysis Transplantation 20, 2089–2096. - PubMed