Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 9:arXiv:2505.05736v1.

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Affiliations

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Da Wu et al. ArXiv. .

Abstract

The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT's effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2-3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.

Keywords: Direct Preference Optimization; Human Phenotype Ontology; Large Language Models; Odds Ratio Preference Optimization; Rare Genetic Disorders; Retrieval Augmented Generation; Supervised Fine-Tuning.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Overview of the MINT framework for transferring multimodal knowledge to Large Language Models.
The framework consists of several pipelines: (1) Upstream Pipeline: A multimodal classifier integrates test and image modality input data to generate top-k most-likely and bottom-q least-likely predictions, which are organized into chosen (preferred) and rejected (non-preferred) responses in natural language to form a preference dataset. (2) Downstream Pipeline-SFT: Standard supervised fine-tuning of base language or vision-language. (3) Downstream Pipeline-MINT with DPO: Direct Preference Optimization approach that uses a frozen reference model (initialized from SFT) and a trainable policy model, optimizing with KL-divergence and maximum likelihood objectives. (4) Downstream Pipeline-MINT with ORPO (default): Our proposed unified framework combining negative log likelihood and odds ratio loss in a single step, directly optimizing the relative probabilities between chosen and rejected responses without requiring a separate reference model.
Figure 2.
Figure 2.. Performance evaluation of rare disease prediction techniques using Llama 3.2–3B-Instruct model.
(a) Comparison of model performance across four evaluation metrics: Hallucination-Free Accuracy (HFA), Top-10 accuracy, Top-1 accuracy, and Coverage-Avoidance Ratio (CAR) for five approaches: Base Model, RAG, SFT, MINT with DPO, and MINT with ORPO (color from light to dark respectively).(b) Effect of varying Acceptance-over-Rejection (AoR) ratios on MINT performance, showing optimal performance at balanced ratio (AoR=1). (c) Radar chart comparing performance on six language understanding benchmarks, demonstrating preserved general capabilities across all fine-tuning techniques.
Figure 3.
Figure 3.. Performance on tissue type classification using different fine-tuning techniques on the Llama 3.2-Vision-11B-Instruct foundation model.
(a) Bar Chart showing performance metrics across four evaluation criteria: Hallucination Free Accuracy (%), Top-5 Accuracy (%), Top-1 Accuracy (%), and CAR (Coverage-Avoidance Ratio). Four fine-tuning approaches are compared: Base Model, SFT, MINT with DPO, MINT with ORPO (color from light to dark respectively). (b) Radar chart comparing performance across multiple general vision-language capabilities for different fine-tuning techniques.
Figure 4.
Figure 4.. Comparative analysis of tissue type classification performance between Base model, SFT, and MINT for similar-looking bile duct and colon tissues.
The figure demonstrates how MINT improves discrimination between histologically similar tissues by leveraging both positive and negative training examples. Top panel shows bile duct tissue classification: a representative training sample with corresponding chosen and rejected tissue types (left), and four testing samples with their respective ranks assigned by Base, SFT, and MINT (Right). Bottom panel shows the same analysis for colon tissue classification. Green values represent ranking for the ground truth tissue class, while red values indication rankings for the visually similar confused class. Lower values represent higher confidence (rank 1 is the highest). Average ranks across all test samples are shown in both panels. Average ranks across all test samples are shown at the bottom of each panel. ‘MINT’ refers to our default implementation of the MINT framework using ORPO.

Similar articles

References

    1. Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. 2019:
    1. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems. 2017;30
    1. McKinzie B, Gan Z, Fauconnier J-P, et al. MM1: methods, analysis and insights from multimodal LLM pre-training. Springer; 2024:304–323.
    1. Tirumala K, Simig D, Aghajanyan A, Morcos A. D4: Improving llm pretraining via document deduplication and diversification. Advances in Neural Information Processing Systems. 2023;36:53983–53995.
    1. Shi W, Ajith A, Xia M, et al. Detecting pretraining data from large language models. arXiv preprint arXiv:231016789. 2023;

Publication types

LinkOut - more resources