Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep;45(9):10870-10882.
doi: 10.1109/TPAMI.2023.3268446. Epub 2023 Aug 7.

Dual Vision Transformer

Dual Vision Transformer

Ting Yao et al. IEEE Trans Pattern Anal Mach Intell. 2023 Sep.

Abstract

Recent advances have presented several strategies to mitigate the computations of self-attention mechanism with high-resolution inputs. Many of these works consider decomposing the global self-attention procedure over image patches into regional and local feature extraction procedures that each incurs a smaller computational complexity. Despite good efficiency, these approaches seldom explore the holistic interactions among all patches, and are thus difficult to fully capture the global semantics. In this paper, we propose a novel Transformer architecture that elegantly exploits the global semantics for self-attention learning, namely Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer local pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to capitalize on global semantics to boost self-attention learning without compromising much computational complexity. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with comparable training complexity.

PubMed Disclaimer