Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Jul 10;6(1):14.
doi: 10.1186/s42492-023-00140-9.

Vision transformer architecture and applications in digital health: a tutorial and survey

Affiliations
Review

Vision transformer architecture and applications in digital health: a tutorial and survey

Khalid Al-Hammuri et al. Vis Comput Ind Biomed Art. .

Abstract

The vision transformer (ViT) is a state-of-the-art architecture for image recognition tasks that plays an important role in digital health applications. Medical images account for 90% of the data in digital medicine applications. This article discusses the core foundations of the ViT architecture and its digital health applications. These applications include image segmentation, classification, detection, prediction, reconstruction, synthesis, and telehealth such as report generation and security. This article also presents a roadmap for implementing the ViT in digital health systems and discusses its limitations and challenges.

Keywords: Artificial intelligence; Digital health; Medical imaging; Telehealth; Vision transformer.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial or non-financial interests.

Figures

Fig. 1
Fig. 1
Transformer architecture [1]
Fig. 2
Fig. 2
Encoder block in the transformer architecture [1]
Fig. 3
Fig. 3
a Illustration of splitting ultrasound images into patches and flattening them in a linear sequence; b Image patch vectorization and linear projection; c Patch embedding in multidimensional space
Fig. 4
Fig. 4
Positional encoding for the feature representations. Top: Sinusoidal representation for the positional encoding (P0-P3) at different indices and dimensions. Bottom: Vector representation for the positional encoding and feature embedding; P is the position encoding and E is the embedding vector
Fig. 5
Fig. 5
MSA process. a MSA process with several attention layers in parallel; b Scaled dot product [8]. The diagram flows upwards from the bottom according to the direction of the arrow
Fig. 6
Fig. 6
MLP
Fig. 7
Fig. 7
Decoder and mask multihead attention block to produce the final image
Fig. 8
Fig. 8
Distribution of medical imaging applications of the ViT according to the survey [33]
Fig. 9
Fig. 9
Comparison of TransUNet and GT using output segmentation results of different organs: a GT (expert reference) and b TransUNet [10]
Fig. 10
Fig. 10
Example of using the ViT for tumor classification in MRI images using TransMed [53]. The tumor is enclosed by the dashed circle indicated by the yellow arrow
Fig. 11
Fig. 11
Examples of using ViT for surgical instruction prediction. Transformer prediction is based on the SIGT method [62]. GT is used as a reference for comparison and validation
Fig. 12
Fig. 12
Top: Different reconstruction methods from T1 weighted acquisition of the fast MRI using different methods. ZF is a traditional Fourier method [70]. LORKAS [71, 72], GANsub [73], SSDU [74], GANprior [75], and SAGAN [76] are generative adversarial network (GAN) reconstruction-based methods. SLATER is a ViT-based method [69]. Bottom: Reconstruction error map [69]
Fig. 13
Fig. 13
Schematic of the components of the ViT in a telehealth ecosystem
Fig. 14
Fig. 14
Examples of report generation from the input image using the ViT. a Sample of results by the IFCC algorithm [89] for report completeness and consistency; b Example of report generation results by the RTMIC algorithm [88]
Fig. 15
Fig. 15
Illustration of data poisoning by an adversarial attack that fools learning-based models trained on medical image datasets
Fig. 16
Fig. 16
Roadmap for ViT implementation
Fig. 17
Fig. 17
Comparison between ViT and ResNet (BiT) architecture accuracies on different sizes of training data. The y-axis is the size of pretraining data in the ImageNet dataset. The x-axis is the accuracy selected from the top 1% of the selected five-shots of ImageNet. Results according to the study in ref. [1]
Fig. 18
Fig. 18
Transformer typical architecture [8]
Fig. 19
Fig. 19
Example of using Transformer architecture for image recognition [1]
Fig. 20
Fig. 20
a Transformer layer diagram; b TransUnet architecture [10]
Fig. 21
Fig. 21
Swin TransUnet architecture [11]
Fig. 22
Fig. 22
Examples of global features that are used for mortality predictions are numbered from (112-139). The numbers in the table depicts the rank sore and each column represents a feature and its importance score by different methods on the horizontal line [109]. AutoInt [111], LSTM [112], TCN [113], Transformer [8], IMVLSTM [114] are the machine learning methodologies

Similar articles

Cited by

References

    1. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th international conference on learning representations, OpenReview.net, Vienna, 3-7 May 2021
    1. Zhang QM, Xu YF, Zhang J, Tao DC. ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond. Int J Comput Vis. 2023;131(5):1141–1162. doi: 10.1007/s11263-022-01739-w. - DOI
    1. Han K, Wang YH, Chen HT, Chen XH, Guo JY, Liu ZH, et al. A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):87–110. doi: 10.1109/TPAMI.2022.3152247. - DOI - PubMed
    1. Wang RS, Lei T, Cui RX, Zhang BT, Meng HY, Nandi AK. Medical image segmentation using deep learning: a survey. IET Image Process. 2022;16(5):1243–1267. doi: 10.1049/ipr2.12419. - DOI
    1. Bai WJ, Suzuki H, Qin C, Tarroni G, Oktay O, Matthews PM et al (2018) Recurrent neural networks for aortic image sequence segmentation with sparse annotations. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, Fichtinger G (eds) Medical image computing and computer assisted intervention. 21st international conference, Granada, September 2018. Lecture notes in computer science (Image processing, computer vision, pattern recognition, and graphics), vol 11073. Springer, Cham, pp 586-594. 10.1007/978-3-030-00937-3_67

LinkOut - more resources