REMEDY: Recipe Merging Dynamics in Large Vision-Language Models
Authors: Didi Zhu, Yibing Song, tao shen, Ziyu Zhao, Jinluan Yang, Min Zhang, Chao Wu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method consistently improves performance on both seen and unseen tasks, underscoring the effectiveness of REMEDY in diverse multi-modal scenarios. The paper includes a dedicated "4 EXPERIMENT" section and tables showing "Performance comparison across different fine-tuning strategies" (Table 1) and "Performance comparison of model fusion methods on seen and unseen tasks" (Table 2). |
| Researcher Affiliation | Collaboration | The author affiliations include "Zhejiang University" (academic), "DAMO Academy, Alibaba Group" (industry), "East China Normal University" (academic), "Academy of Social Governance, Zhejiang University" (academic), "Hupan Lab" (industry/research lab), and "Academy of Computer Science and Technology, Zhejiang University" (academic). This mix indicates a collaboration between academia and industry. |
| Pseudocode | No | The paper describes the methodology in narrative text and figures (e.g., Figure 3), but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | A code example is provided as supplementary material, demonstrating the core components of our approach. Upon acceptance, we will release all of the data and the complete training and testing code to facilitate the full reproducibility of our results. |
| Open Datasets | Yes | The paper utilizes several well-known public datasets, citing their original publications: Flickr30k (Young et al., 2014), COCO (Lin et al., 2014), Science QA (Lu et al., 2022), Text VQA (Singh et al., 2019), MM-Vet (Yu et al., 2024), MMBench (Zhang et al., 2023a), MM-Bench-Chinese (Zhang et al., 2023a), Viz Wiz (Gurari et al., 2018), POPE (Li et al., 2023), and Text Caps (Sidorov et al., 2020). |
| Dataset Splits | No | The paper mentions data sizes for datasets in Table 6 and describes sampling 1000 data points from seen datasets for training the allocator, but it does not provide specific training, validation, and test splits (e.g., percentages, exact counts, or references to standard splits) for the main experiments. |
| Hardware Specification | No | The paper mentions "Training GPU Memory (GB)" and "Inference GPU Memory (GB)" in Table 7, but it does not specify the exact GPU models, CPU types, or other detailed hardware specifications used for the experiments. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies or their version numbers (e.g., Python, PyTorch, CUDA versions) used in the implementation. |
| Experiment Setup | Yes | Table 6, titled "Recipe Finetuning Configurations", explicitly lists learning rates and training epochs for Flickr30K (2e-5, 1 epoch), COCO (2e-5, 1 epoch), SQA (2e-4, 5 epochs), and Text VQA (2e-5, 5 epochs). It also states that "The rank of Lo RA is set to 128." and "The learning rate of the allocator is set to 2e-4." for training the modality-aware allocator. |