Diffusion Instruction Tuning

Authors: Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Alexander Teare

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Lavender, a simple supervised finetuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model s visual understanding and significantly boosts performance across inand out-of-distribution tasks. Lavender requires just 0.13 million training examples 2.5% of typical large-scale SFT datasets and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, Mini CPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks.
Researcher Affiliation Industry 1Centre for AI, Astra Zeneca, Cambridge, UK 2Google Deep Mind, UK. Correspondence to: EMAIL.
Pseudocode Yes Figure 4. Sketch of Diffusion Instruction Tuning (left) and a short pseudo code (right), whose full version is available in Appendix F. Algorithm 1 Diffusion Instruction Tuning. Algorithm 2 Diffusion Instruction Tuning (Full Version)
Open Source Code Yes Code, training data, and models are available on the project page.
Open Datasets Yes For datasets preparation, we process images with Stable Diffusion to obtain per-word attention targets across Flickr30k, Laion50k, RLAIF-V 83k, and OCRVQA30k.
Dataset Splits Yes The evaluation metrics in this paper adhere to the default settings of the respective benchmarks.
Hardware Specification Yes fine-tunes on standard hardware (8 GPUs) in a single day. ... All experiments are conducted on NVIDIA GPUs (V100, A10G, or A100), using Py Torch as our deep learning framework. ... We limit the inversion steps to 5 and diffusion steps to 10 for efficiency, enabling us to process each image in roughly 20 seconds on a single V100 GPU.
Software Dependencies No All experiments are conducted on NVIDIA GPUs (V100, A10G, or A100), using Py Torch as our deep learning framework. For large-scale training and efficient memory usage, we employ Deepspeed in the Mini CPMLlama 3-v2.5 experiments and Fully Sharded Data Parallel (FSDP) for the Llama 3.2-11B-Vision Instruct model.
Experiment Setup Yes Further details on DM inversion process, attention extraction, training hyperparameters, and computing environments are provided in Appendix L. Full details, including hyperparameters, dataset preprocessing, and design choices, are provided in Appendix M. ... Pretraining the Aligner Network... By carefully scaling the learning rate during this phase... Attention Aggregation and Normalization Choices. We also apply instance or batch normalization within the Aligner network... Configuring the Aligner Network... Short Training Schedules and PEFT. We limit training to a fraction of an epoch to minimise overfitting and catastrophic forgetting. Additionally, we incorporate Parameter-Efficient Fine-Tuning (PEFT) methods, such as Lo RA, to constrain the number of parameters being updated. ... Sampling Strategies. During fine-tuning, the VLM predicts text for each image and question. We define two sampling strategies to determine which words from the predicted text are eligible for computing the MSE loss: root word match and exact word match