reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Multimodal Large Language Models Against Modality Conflict

Authors: Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised finetuning method shows promising and stable performance.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence and Data Science, University of Science and Technology of China 2Department of Electronic Engineering and Information Science, University of Science and Technology of China 3Huawei Technologies Co., Ltd..
Pseudocode	No	The paper describes methods (prompt engineering, supervised fine-tuning, and reinforcement learning) and their formulations but does not include any explicit pseudocode blocks or algorithms. Figure 2 illustrates a pipeline but is not pseudocode.
Open Source Code	Yes	The code and dataset are available at https://github.com/zmzhang2000/ MMMC.
Open Datasets	Yes	The code and dataset are available at https://github.com/zmzhang2000/ MMMC. ... we collect images from the widely-used vision-language datasets, Visual Genome (Krishna et al., 2017), and construct natural language questions conflicting with the image content and corresponding answers.
Dataset Splits	Yes	Finally, we obtain 20K image-question-answer triples in the MMMC dataset and randomly split them into 18K training samples and 2K testing samples.
Hardware Specification	No	This work was supported by National Key R&D Program of China under Contract 2022ZD0119802, National Natural Science Foundation of China under Contract 623B2097, and the Youth Innovation Promotion Association CAS. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC. The paper mentions a 'GPU cluster' but does not provide specific details like GPU models, CPU specifications, or memory.
Software Dependencies	No	We implement all proposed method using Hugging Face Transformers and Open RLHF library (Hu et al., 2024). ... We use the Llama-3.3-70B-Instruct for reward model. The paper mentions software libraries and models but does not provide specific version numbers for Hugging Face Transformers or Open RLHF library.
Experiment Setup	Yes	For the supervised fine-tuning, we use the Adam optimizer with a learning rate of 5 10 6 and a batch size of 8. We train the model for 1 epochs on the MMMC dataset with 10000 training samples except for the ablation study. For the reinforcement learning, we use the Adam optimizer with a learning rate of 9.65 10 6 and a batch size of 8. We train the model on the MMMC dataset with only 1000 training samples since longer reinforcement learning will cause the model collapse. We set the KL coefficient to 0.01 and the max response length to 128. Both the supervised fine-tuning and reinforcement learning methods are trained with Lo RA (Hu et al., 2021).