Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 12:8:1527980.
doi: 10.3389/frai.2025.1527980. eCollection 2025.

MedAlmighty: enhancing disease diagnosis with large vision model distillation

Affiliations

MedAlmighty: enhancing disease diagnosis with large vision model distillation

Yajing Ren et al. Front Artif Intell. .

Abstract

Introduction: Accurate disease diagnosis is critical in the medical field, yet it remains a challenging task due to the limited, heterogeneous, and complex nature of medical data. These challenges are particularly pronounced in multimodal tasks requiring the integration of diverse data sources. While lightweight models offer computational efficiency, they often lack the comprehensive understanding necessary for reliable clinical predictions. Conversely, large vision models, trained on extensive general-domain datasets, provide strong generalization but fall short in specialized medical applications due to domain mismatch and limited medical data availability.

Methods: To bridge the gap between general and specialized performance, we propose MedAlmighty, a knowledge distillation-based framework that synergizes the strengths of both large and small models. In this approach, we utilize DINOv2-a pre-trained large vision model-as a frozen teacher, and a lightweight convolutional neural network (CNN) as the trainable student. The student model is trained using both hard labels from the ground truth and soft targets generated by the teacher model. We adopt a hybrid loss function that combines cross-entropy loss (for classification accuracy) and Kullback-Leibler divergence (for distillation), enabling the student model to capture rich semantic features while remaining efficient and domain-aware.

Results: Experimental evaluations reveal that MedAlmighty significantly improves disease diagnosis performance across datasets characterized by sparse and diverse medical data. The proposed model outperforms baselines by effectively integrating the generalizable representations of large models with the specialized knowledge from smaller models. The results confirm improved robustness and accuracy in complex diagnostic scenarios.

Discussion: The MedAlmighty framework demonstrates that incorporating general-domain representations via frozen large vision models-when guided by task-specific distillation strategies-can enhance the performance of lightweight medical models. This approach offers a promising solution to data scarcity and domain gap issues in medical imaging. Future work may explore extending this distillation strategy to other medical modalities and incorporating multimodal alignment for even richer representation learning.

Keywords: disease diagnosis; domain generalization; knowledge distillation; large vision model; model capacity.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Comparison of generalization and training efficiency between CNNs and DINOv2. This figure provides a comprehensive comparison of CNNs and DINOv2 in terms of generalization and training efficiency. (a) Generalization Performance: CNNs struggle with robustness and accuracy on unseen data, while DINOv2 exhibits stronger generalization across diverse tasks due to self-supervised learning. (b) Training Efficiency: DINOv2 requires significantly more computational resources and training time, limiting its practicality. (c) Synergy Potential: The figure also underscores the advantages of combining CNNs' efficiency with DINOv2's generalization, motivating the integration of both in a unified framework.
Figure 2
Figure 2
Comparing AUC values of DINOv2-ViTs14 with ResNet18, DINOv2-ViTb14 with ResNet50, and DINOv2-ViTl14 with ResNet50 on 12 MedMNIST datasets. Results are based on experiments using MedMNISTV2, where all models were evaluated on 224 × 224 images.
Figure 3
Figure 3
Comparing ACC values of DINOv2-ViTs14 with ResNet18, DINOv2-ViTb14 with ResNet50, and DINOv2-ViTl14 with ResNet50 on 12 MedMNIST datasets. Results are based on experiments using MedMNISTV2, where all models were evaluated on 224 × 224 images.
Figure 4
Figure 4
Performance evaluation on RetinaMNIST: AUC and ACC performance of ( t/α) with t = 2; AUC and ACC performance of (t/α) with α=0.2.
Figure 5
Figure 5
t-SNE visualization of features (ResNet50, DINOV2-ViTb14, MedAlmighty).
Figure 6
Figure 6
Input images (top) and heatmaps (bottom). Color intensity reflects the relative importance of image regions for the model's classification.

Similar articles

References

    1. Arafa A. B., El-Fishawy N. A., Badawy M., Radad M. (2023). RN-autoencoder: reduced noise autoencoder for classifying imbalanced cancer genomic data. J. Biol. Eng. 17:7. 10.1186/s13036-022-00319-3 - DOI - PMC - PubMed
    1. Arumugam M., Thiyagarajan A., Adhi L., Alagar S. (2024). Crossover smell agent optimized multilayer perceptron for precise brain tumor classification on mri images. Expert Syst. Appl. 238:121453. 10.1016/j.eswa.2023.121453 - DOI
    1. Bao H., Dong L., Piao S., Wei F. (2021). Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
    1. Caron M., Touvron H., Misra I., Jégou H., Mairal J., Bojanowski P., et al. (2021). “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660. 10.1109/ICCV48922.2021.00951 - DOI
    1. Chen J., Fu C., Xie H., Zheng X., Geng R., Sham C.-W. (2022). Uncertainty teacher with dense focal loss for semi-supervised medical image segmentation. Comput. Biol. Med. 149:106034. 10.1016/j.compbiomed.2022.106034 - DOI - PubMed

LinkOut - more resources