. 2025 Oct-Dec;108(4):368504251375188.

doi: 10.1177/00368504251375188. Epub 2025 Nov 4.

TVNet: Multimodal medical image fusion by dual-branch network with vision transformer and one-shot aggregation

Jianguo Wang¹, Wenran Jia², Yuhang Liu², Pengfei Wu², Peng Geng^{2

3}, Xuguang Meng⁴

Affiliations

¹ School of Industrial Internet, Beijing Information Technology College, Beijing, China.
² School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China.
³ Department of Radiology and Diagnostic Imaging, University of Alberta, Edmonton, Alberta, Canada.
⁴ Department of Technology Development, Hebei Information Industry Technology Co., Ltd., Shijiazhuang, China.

PMID: 41185898
PMCID: PMC12586861
DOI: 10.1177/00368504251375188

TVNet: Multimodal medical image fusion by dual-branch network with vision transformer and one-shot aggregation

Jianguo Wang et al. Sci Prog. 2025 Oct-Dec.

. 2025 Oct-Dec;108(4):368504251375188.

doi: 10.1177/00368504251375188. Epub 2025 Nov 4.

Authors

Jianguo Wang¹, Wenran Jia², Yuhang Liu², Pengfei Wu², Peng Geng^{2

3}, Xuguang Meng⁴

Affiliations

¹ School of Industrial Internet, Beijing Information Technology College, Beijing, China.
² School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China.
³ Department of Radiology and Diagnostic Imaging, University of Alberta, Edmonton, Alberta, Canada.
⁴ Department of Technology Development, Hebei Information Industry Technology Co., Ltd., Shijiazhuang, China.

PMID: 41185898
PMCID: PMC12586861
DOI: 10.1177/00368504251375188

Abstract

The task of medical image fusion involves synthesizing complementary information from different modal medical images, which is of very significant for clinical diagnosis. The existing medical image fusion algorithms overly rely on convolution operations and cannot establish long-range dependencies on the source images. This can lead to edge blurring and loss of details in the fused images. Because the Transformer can effectively model long-range dependencies through self-attention, a novel and effective dual-branch feature enhancement network called TVNet is proposed to fuse multimodal medical images. This network combines Vision Transformer and Convolutional Neural Network to extract global context information and local information to preserve detailed textures and highlight the structural characteristics in source images. Furthermore, to extract the multiscale information of images, an enhancement module is used to obtain multiscale characterization information, and the two branches information are efficiently aggregated at the same time. In addition, a hybrid loss function is designed to optimize the fusion results at three levels of structure, feature, and gradient. Experiment results prove that the performance of the proposed fusion network outperforms seven state-of-the-art methods in both subjective visual effects and objective metrics. Our code is available at https://github.com/sineagles/TVNet.

Keywords: Medical image fusion; convolution neural network; long-range dependencies; multiscale features; vision transformer.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interestsThe authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

**Figure 1.**
Architecture of the proposed TVNet.

**Figure 2.**
One-Shot Aggregation (OSA) module.

**Figure 3.**
Vision Transformer (ViT) module.

**Figure 4.**
Coordinate attention block.

**Figure 5.**
Subpixel context enhancement module.

**Figure 6.**
Comparison on computed tomography-magnetic resonance imaging (CT-MRI) fusion.

**Figure 7.**
Comparison on magnetic resonance imaging (PET-MRI) fusion.

**Figure 8.**
Comparison on single-photon emission computed tomography-magnetic resonance imaging (SPECT-MRI) fusion.

**Figure 9.**
Fusion performance of different disease images.

**Figure 10.**
Histogram of average values of six metrics.

**Figure 11.**
Fused images of computed tomography-magnetic resonance imaging (CT-MRI), positron emission tomography-magnetic resonance imaging (PET-MRI), and single-photon emission computed tomography-magnetic resonance imaging (SPECT-MRI) using different modules.

**Figure 12.**
Ablation experiments based on different loss functions.

See this image and copyright information in PMC

References

1. Du J, Li W, Lu K, et al. An overview of multi-modal medical image fusion. Neurocomputing 2016; 215: 3–20.
1. Bilal O, Asif S, Zhao M, et al. Differential evolution optimization based ensemble framework for accurate cervical cancer diagnosis. Appl Soft Comput 2024; 167: 112366.
1. Hekmat A, Zhang Z, Ur Rehman Khan S, et al. An attention-fused architecture for brain tumor diagnosis. Biomed Signal Process Control 2025; 101: 107221.
1. James AP, Dasarathy BV. Medical image fusion: a survey of the state of the art. Inf Fusion 2014; 19: 4–19.
1. Ghandour C, El-shafai W, El-Rabaie ESM, et al. Applying medical image fusion based on a simple deep learning principal component analysis network. Multimed Tools Appl 2024; 83: 5971–6003.

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Atypon

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TVNet: Multimodal medical image fusion by dual-branch network with vision transformer and one-shot aggregation

Affiliations

TVNet: Multimodal medical image fusion by dual-branch network with vision transformer and one-shot aggregation

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources