reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

REMEDY: Recipe Merging Dynamics in Large Vision-Language Models

Authors: Didi Zhu, Yibing Song, tao shen, Ziyu Zhao, Jinluan Yang, Min Zhang, Chao Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our method consistently improves performance on both seen and unseen tasks, underscoring the effectiveness of REMEDY in diverse multi-modal scenarios. The paper includes a dedicated "4 EXPERIMENT" section and tables showing "Performance comparison across different ﬁne-tuning strategies" (Table 1) and "Performance comparison of model fusion methods on seen and unseen tasks" (Table 2).
Researcher Affiliation	Collaboration	The author affiliations include "Zhejiang University" (academic), "DAMO Academy, Alibaba Group" (industry), "East China Normal University" (academic), "Academy of Social Governance, Zhejiang University" (academic), "Hupan Lab" (industry/research lab), and "Academy of Computer Science and Technology, Zhejiang University" (academic). This mix indicates a collaboration between academia and industry.
Pseudocode	No	The paper describes the methodology in narrative text and figures (e.g., Figure 3), but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	A code example is provided as supplementary material, demonstrating the core components of our approach. Upon acceptance, we will release all of the data and the complete training and testing code to facilitate the full reproducibility of our results.
Open Datasets	Yes	The paper utilizes several well-known public datasets, citing their original publications: Flickr30k (Young et al., 2014), COCO (Lin et al., 2014), Science QA (Lu et al., 2022), Text VQA (Singh et al., 2019), MM-Vet (Yu et al., 2024), MMBench (Zhang et al., 2023a), MM-Bench-Chinese (Zhang et al., 2023a), Viz Wiz (Gurari et al., 2018), POPE (Li et al., 2023), and Text Caps (Sidorov et al., 2020).
Dataset Splits	No	The paper mentions data sizes for datasets in Table 6 and describes sampling 1000 data points from seen datasets for training the allocator, but it does not provide specific training, validation, and test splits (e.g., percentages, exact counts, or references to standard splits) for the main experiments.
Hardware Specification	No	The paper mentions "Training GPU Memory (GB)" and "Inference GPU Memory (GB)" in Table 7, but it does not specify the exact GPU models, CPU types, or other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies or their version numbers (e.g., Python, PyTorch, CUDA versions) used in the implementation.
Experiment Setup	Yes	Table 6, titled "Recipe Finetuning Conﬁgurations", explicitly lists learning rates and training epochs for Flickr30K (2e-5, 1 epoch), COCO (2e-5, 1 epoch), SQA (2e-4, 5 epochs), and Text VQA (2e-5, 5 epochs). It also states that "The rank of Lo RA is set to 128." and "The learning rate of the allocator is set to 2e-4." for training the modality-aware allocator.