Review

. 2023 Jul 10;6(1):14.

doi: 10.1186/s42492-023-00140-9.

Vision transformer architecture and applications in digital health: a tutorial and survey

Khalid Al-Hammuri¹, Fayez Gebali², Awos Kanan³, Ilamparithi Thirumarai Chelvan²

Affiliations

¹ Electrical and Computer Engineering, University of Victoria, Victoria, V8W 2Y2, Canada. khalidalhammuri@uvic.ca.
² Electrical and Computer Engineering, University of Victoria, Victoria, V8W 2Y2, Canada.
³ Computer Engineering, Princess Sumaya University for Technology, Amman, 11941, Jordan.

PMID: 37428360
PMCID: PMC10333157
DOI: 10.1186/s42492-023-00140-9

Review

Vision transformer architecture and applications in digital health: a tutorial and survey

Khalid Al-Hammuri et al. Vis Comput Ind Biomed Art. 2023.

. 2023 Jul 10;6(1):14.

doi: 10.1186/s42492-023-00140-9.

Authors

Khalid Al-Hammuri¹, Fayez Gebali², Awos Kanan³, Ilamparithi Thirumarai Chelvan²

Affiliations

¹ Electrical and Computer Engineering, University of Victoria, Victoria, V8W 2Y2, Canada. khalidalhammuri@uvic.ca.
² Electrical and Computer Engineering, University of Victoria, Victoria, V8W 2Y2, Canada.
³ Computer Engineering, Princess Sumaya University for Technology, Amman, 11941, Jordan.

PMID: 37428360
PMCID: PMC10333157
DOI: 10.1186/s42492-023-00140-9

Abstract

The vision transformer (ViT) is a state-of-the-art architecture for image recognition tasks that plays an important role in digital health applications. Medical images account for 90% of the data in digital medicine applications. This article discusses the core foundations of the ViT architecture and its digital health applications. These applications include image segmentation, classification, detection, prediction, reconstruction, synthesis, and telehealth such as report generation and security. This article also presents a roadmap for implementing the ViT in digital health systems and discusses its limitations and challenges.

Keywords: Artificial intelligence; Digital health; Medical imaging; Telehealth; Vision transformer.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial or non-financial interests.

Figures

**Fig. 2**
Encoder block in the transformer architecture [1]

**Fig. 3**
a Illustration of splitting ultrasound images into patches and flattening them in a linear sequence; b Image patch vectorization and linear projection; c Patch embedding in multidimensional space

**Fig. 4**
Positional encoding for the feature representations. Top: Sinusoidal representation for the positional encoding (P0-P3) at different indices and dimensions. Bottom: Vector representation for the positional encoding and feature embedding; P is the position encoding and E is the embedding vector

**Fig. 5**
MSA process. a MSA process with several attention layers in parallel; b Scaled dot product [8]. The diagram flows upwards from the bottom according to the direction of the arrow

**Fig. 7**
Decoder and mask multihead attention block to produce the final image

**Fig. 8**
Distribution of medical imaging applications of the ViT according to the survey [33]

**Fig. 9**
Comparison of TransUNet and GT using output segmentation results of different organs: a GT (expert reference) and b TransUNet [10]

**Fig. 10**
Example of using the ViT for tumor classification in MRI images using TransMed [53]. The tumor is enclosed by the dashed circle indicated by the yellow arrow

**Fig. 11**
Examples of using ViT for surgical instruction prediction. Transformer prediction is based on the SIGT method [62]. GT is used as a reference for comparison and validation

**Fig. 12**
Top: Different reconstruction methods from T₁ weighted acquisition of the fast MRI using different methods. ZF is a traditional Fourier method [70]. LORKAS [71, 72], GAN_sub [73], SSDU [74], GAN_prior [75], and SAGAN [76] are generative adversarial network (GAN) reconstruction-based methods. SLATER is a ViT-based method [69]. Bottom: Reconstruction error map [69]

**Fig. 13**
Schematic of the components of the ViT in a telehealth ecosystem

**Fig. 14**
Examples of report generation from the input image using the ViT. a Sample of results by the IFCC algorithm [89] for report completeness and consistency; b Example of report generation results by the RTMIC algorithm [88]

**Fig. 15**
Illustration of data poisoning by an adversarial attack that fools learning-based models trained on medical image datasets

**Fig. 16**
Roadmap for ViT implementation

**Fig. 17**
Comparison between ViT and ResNet (BiT) architecture accuracies on different sizes of training data. The y-axis is the size of pretraining data in the ImageNet dataset. The x-axis is the accuracy selected from the top 1% of the selected five-shots of ImageNet. Results according to the study in ref. [1]

**Fig. 18**
Transformer typical architecture [8]

**Fig. 19**
Example of using Transformer architecture for image recognition [1]

**Fig. 20**
a Transformer layer diagram; b TransUnet architecture [10]

**Fig. 21**
Swin TransUnet architecture [11]

**Fig. 22**
Examples of global features that are used for mortality predictions are numbered from (112-139). The numbers in the table depicts the rank sore and each column represents a feature and its importance score by different methods on the horizontal line [109]. AutoInt [111], LSTM [112], TCN [113], Transformer [8], IMVLSTM [114] are the machine learning methodologies

See this image and copyright information in PMC

Cited by

Metaheuristic optimizers integrated with vision transformer model for severity detection and classification via multimodal COVID-19 images.
Padmavathi V, Ganesan K. Padmavathi V, et al. Sci Rep. 2025 Apr 22;15(1):13941. doi: 10.1038/s41598-025-98593-w. Sci Rep. 2025. PMID: 40263404 Free PMC article.
Multi-task approach based on combined CNN-transformer for efficient segmentation and classification of breast tumors in ultrasound images.
Tagnamas J, Ramadan H, Yahyaouy A, Tairi H. Tagnamas J, et al. Vis Comput Ind Biomed Art. 2024 Jan 26;7(1):2. doi: 10.1186/s42492-024-00155-w. Vis Comput Ind Biomed Art. 2024. PMID: 38273164 Free PMC article.
Deep Learning for 3D Vascular Segmentation in Phase Contrast Tomography.
Yagis E, Aslani S, Jain Y, Zhou Y, Rahmani S, Brunet J, Bellier A, Werlein C, Ackermann M, Jonigk D, Tafforeau P, Lee PD, Walsh C. Yagis E, et al. Res Sq [Preprint]. 2024 Jul 16:rs.3.rs-4613439. doi: 10.21203/rs.3.rs-4613439/v1. Res Sq. 2024. Update in: Sci Rep. 2024 Nov 8;14(1):27258. doi: 10.1038/s41598-024-77582-5. PMID: 39070623 Free PMC article. Updated. Preprint.
How Artificial Intelligence Is Shaping Medical Imaging Technology: A Survey of Innovations and Applications.
Pinto-Coelho L. Pinto-Coelho L. Bioengineering (Basel). 2023 Dec 18;10(12):1435. doi: 10.3390/bioengineering10121435. Bioengineering (Basel). 2023. PMID: 38136026 Free PMC article. Review.
ViT-PSO-SVM: Cervical Cancer Predication Based on Integrating Vision Transformer with Particle Swarm Optimization and Support Vector Machine.
AlMohimeed A, Shehata M, El-Rashidy N, Mostafa S, Samy Talaat A, Saleh H. AlMohimeed A, et al. Bioengineering (Basel). 2024 Jul 18;11(7):729. doi: 10.3390/bioengineering11070729. Bioengineering (Basel). 2024. PMID: 39061811 Free PMC article.

See all "Cited by" articles

References

1. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th international conference on learning representations, OpenReview.net, Vienna, 3-7 May 2021
1. Zhang QM, Xu YF, Zhang J, Tao DC. ViTAEv2: vision transformer advanced by exploring inductive bias for image recognition and beyond. Int J Comput Vis. 2023;131(5):1141–1162. doi: 10.1007/s11263-022-01739-w. - DOI
1. Han K, Wang YH, Chen HT, Chen XH, Guo JY, Liu ZH, et al. A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell. 2023;45(1):87–110. doi: 10.1109/TPAMI.2022.3152247. - DOI - PubMed
1. Wang RS, Lei T, Cui RX, Zhang BT, Meng HY, Nandi AK. Medical image segmentation using deep learning: a survey. IET Image Process. 2022;16(5):1243–1267. doi: 10.1049/ipr2.12419. - DOI
1. Bai WJ, Suzuki H, Qin C, Tarroni G, Oktay O, Matthews PM et al (2018) Recurrent neural networks for aortic image sequence segmentation with sparse annotations. In: Frangi AF, Schnabel JA, Davatzikos C, Alberola-López C, Fichtinger G (eds) Medical image computing and computer assisted intervention. 21st international conference, Granada, September 2018. Lecture notes in computer science (Image processing, computer vision, pattern recognition, and graphics), vol 11073. Springer, Cham, pp 586-594. 10.1007/978-3-030-00937-3_67

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Vision transformer architecture and applications in digital health: a tutorial and survey

Affiliations

Vision transformer architecture and applications in digital health: a tutorial and survey

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources