REMEDY: Recipe Merging Dynamics in Large Vision-Language Models

Authors: Didi Zhu, Yibing Song, tao shen, Ziyu Zhao, Jinluan Yang, Min Zhang, Chao Wu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our method consistently improves performance on both seen and unseen tasks, underscoring the effectiveness of REMEDY in diverse multi-modal scenarios. The paper includes a dedicated "4 EXPERIMENT" section and tables showing "Performance comparison across different fine-tuning strategies" (Table 1) and "Performance comparison of model fusion methods on seen and unseen tasks" (Table 2).
Researcher Affiliation Collaboration The author affiliations include "Zhejiang University" (academic), "DAMO Academy, Alibaba Group" (industry), "East China Normal University" (academic), "Academy of Social Governance, Zhejiang University" (academic), "Hupan Lab" (industry/research lab), and "Academy of Computer Science and Technology, Zhejiang University" (academic). This mix indicates a collaboration between academia and industry.
Pseudocode No The paper describes the methodology in narrative text and figures (e.g., Figure 3), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes A code example is provided as supplementary material, demonstrating the core components of our approach. Upon acceptance, we will release all of the data and the complete training and testing code to facilitate the full reproducibility of our results.
Open Datasets Yes The paper utilizes several well-known public datasets, citing their original publications: Flickr30k (Young et al., 2014), COCO (Lin et al., 2014), Science QA (Lu et al., 2022), Text VQA (Singh et al., 2019), MM-Vet (Yu et al., 2024), MMBench (Zhang et al., 2023a), MM-Bench-Chinese (Zhang et al., 2023a), Viz Wiz (Gurari et al., 2018), POPE (Li et al., 2023), and Text Caps (Sidorov et al., 2020).
Dataset Splits No The paper mentions data sizes for datasets in Table 6 and describes sampling 1000 data points from seen datasets for training the allocator, but it does not provide specific training, validation, and test splits (e.g., percentages, exact counts, or references to standard splits) for the main experiments.
Hardware Specification No The paper mentions "Training GPU Memory (GB)" and "Inference GPU Memory (GB)" in Table 7, but it does not specify the exact GPU models, CPU types, or other detailed hardware specifications used for the experiments.
Software Dependencies No The paper does not explicitly mention specific software dependencies or their version numbers (e.g., Python, PyTorch, CUDA versions) used in the implementation.
Experiment Setup Yes Table 6, titled "Recipe Finetuning Configurations", explicitly lists learning rates and training epochs for Flickr30K (2e-5, 1 epoch), COCO (2e-5, 1 epoch), SQA (2e-4, 5 epochs), and Text VQA (2e-5, 5 epochs). It also states that "The rank of Lo RA is set to 128." and "The learning rate of the allocator is set to 2e-4." for training the modality-aware allocator.