Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 22;20(8):e0330328.
doi: 10.1371/journal.pone.0330328. eCollection 2025.

Progressive decomposition of infrared and visible image fusion network with joint transformer and Resnet

Affiliations

Progressive decomposition of infrared and visible image fusion network with joint transformer and Resnet

Fang Zhu et al. PLoS One. .

Abstract

The objective of image fusion is to synthesize information from multiple source images into a single, high-quality composite that is information-rich, thereby enhancing both human visual interpretation and machine perception capabilities. This process also establishes a robust foundation for downstream image-related tasks. Nevertheless, current deep learning-based networks frequently neglect the distinctive features inherent in source images, presenting challenges in effectively balancing the interplay between basic and detailed features. To tackle this limitation, we introduce a progressive decomposition network that integrates Lite Transformer (LT) and ResNet architecture for infrared and visible image fusion (IVIF). Our methodology unfolds in three principal stages: Initially, a foundational convolutional neural network (CNN) is deployed to extract coarse-scale features from the source images. Subsequently, the LT is employed to bifurcate these coarse features into basic and detailed feature components. In the second phase, to augment the detail information across various inter-layer extractions, we substitute the conventional ResNet preprocessing with a combination of coarse and LT module. Cascade LT operations are implemented following the initial two ResNet blocks (ResB), enabling two-branch feature extraction from these reconfigured blocks. The final stage involves the design of specialized fusion sub-networks to process the basic and detail information blocks extracted from different layers. These processed image feature blocks are then channeled through semantic injection module (SIM) and Transformer decoders to generate the fused image. Complementing this architecture, we have developed a semantic information extraction module that aligns with the progressive inter-layer detail extraction framework. The LT module is strategically embedded within the ResNet network architecture to optimize the extraction of both basic and detailed features across diverse layers. Moreover, we introduce a novel correlation loss function that operates on the basic and detail information between layers, facilitating the correlation of basic features while maintaining the independence of detail features across layers. Through comprehensive qualitative and quantitative analyses conducted on multiple infrared-visible datasets, we demonstrate the superior potential of our proposed network for advanced visual tasks. Our network exhibits remarkable performance in detail extraction, significantly outperforming existing deep learning methodologies in this domain.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Provides a visual comparison of different fusion methods.
Fig 2
Fig 2. A comparative analysis of streamlined fusion frameworks.
Fig 3
Fig 3. Overall framework diagram.
Fig 4
Fig 4. Annotated supplementary documentation for the comprehensive framework schematic.
Fig 5
Fig 5. A practical demonstration of key modules.
Fig 6
Fig 6. IR & VIS image encoding flowchart.
Fig 7
Fig 7. IR & VIS image decoding flowchart.
Fig 8
Fig 8. Qualitative comparison of ten fusion methods on scene 05005 from the RoadScene dataset.
Fig 9
Fig 9. Localized zoom effects for ten fusion results of the 05005 scene.
Fig 10
Fig 10. Cumulative distribution of the six metrics on the RoadScene dataset.
Fig 11
Fig 11. Qualitative comparison of ten fusion methods on scene 00754N from the MSRS dataset.
Fig 12
Fig 12. Cumulative distribution of the six metrics on the MSRS dataset.
Fig 13
Fig 13. Qualitative comparison of ten fusion methods on scene 01443 from the M3FD dataset.
Fig 14
Fig 14. Cumulative distribution of the six metrics on the M3FD dataset.
Fig 15
Fig 15. Segmentation results of ten fusion methods under BANet.
Fig 16
Fig 16. Visualization results of ablation experiments.

Similar articles

References

    1. Xiao G, Bavirisetti DP, Liu G, Zhang X. Image Fusion. Springer; 2020.
    1. Paramanandham N, Rajendiran K. Infrared and visible image fusion using discrete cosine transform and swarm intelligence for surveillance applications. Infrared Physics & Technology. 2018;88:13–22. doi: 10.1016/j.infrared.2017.11.006 - DOI
    1. Bogdoll D, Nitsche M, Zöllner JM. Anomaly detection in autonomous driving: A survey. In: IEEE Conference on Computer Vision and Pattern Recognition. 2022. 4488–99.
    1. Zhang P, Wang D, Lu H, Yang X. Learning Adaptive Attribute-Driven Representation for Real-Time RGB-T Tracking. Int J Comput Vis. 2021;129(9):2714–29. doi: 10.1007/s11263-021-01495-3 - DOI
    1. Zhang H, Xu H, Xiao Y, Guo X, Ma J. Rethinking the Image Fusion: A Fast Unified Image Fusion Network based on Proportional Maintenance of Gradient and Intensity. AAAI. 2020;34(07):12797–804. doi: 10.1609/aaai.v34i07.6975 - DOI

LinkOut - more resources