reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diffusion Instruction Tuning

Authors: Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Alexander Teare

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Lavender, a simple supervised finetuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model s visual understanding and significantly boosts performance across inand out-of-distribution tasks. Lavender requires just 0.13 million training examples 2.5% of typical large-scale SFT datasets and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, Mini CPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks.
Researcher Affiliation	Industry	1Centre for AI, Astra Zeneca, Cambridge, UK 2Google Deep Mind, UK. Correspondence to: EMAIL.
Pseudocode	Yes	Figure 4. Sketch of Diffusion Instruction Tuning (left) and a short pseudo code (right), whose full version is available in Appendix F. Algorithm 1 Diffusion Instruction Tuning. Algorithm 2 Diffusion Instruction Tuning (Full Version)
Open Source Code	Yes	Code, training data, and models are available on the project page.
Open Datasets	Yes	For datasets preparation, we process images with Stable Diffusion to obtain per-word attention targets across Flickr30k, Laion50k, RLAIF-V 83k, and OCRVQA30k.
Dataset Splits	Yes	The evaluation metrics in this paper adhere to the default settings of the respective benchmarks.
Hardware Specification	Yes	fine-tunes on standard hardware (8 GPUs) in a single day. ... All experiments are conducted on NVIDIA GPUs (V100, A10G, or A100), using Py Torch as our deep learning framework. ... We limit the inversion steps to 5 and diffusion steps to 10 for efficiency, enabling us to process each image in roughly 20 seconds on a single V100 GPU.
Software Dependencies	No	All experiments are conducted on NVIDIA GPUs (V100, A10G, or A100), using Py Torch as our deep learning framework. For large-scale training and efficient memory usage, we employ Deepspeed in the Mini CPMLlama 3-v2.5 experiments and Fully Sharded Data Parallel (FSDP) for the Llama 3.2-11B-Vision Instruct model.
Experiment Setup	Yes	Further details on DM inversion process, attention extraction, training hyperparameters, and computing environments are provided in Appendix L. Full details, including hyperparameters, dataset preprocessing, and design choices, are provided in Appendix M. ... Pretraining the Aligner Network... By carefully scaling the learning rate during this phase... Attention Aggregation and Normalization Choices. We also apply instance or batch normalization within the Aligner network... Configuring the Aligner Network... Short Training Schedules and PEFT. We limit training to a fraction of an epoch to minimise overfitting and catastrophic forgetting. Additionally, we incorporate Parameter-Efficient Fine-Tuning (PEFT) methods, such as Lo RA, to constrain the number of parameters being updated. ... Sampling Strategies. During fine-tuning, the VLM predicts text for each image and question. We define two sampling strategies to determine which words from the predicted text are eligible for computing the MSE loss: root word match and exact word match