Robust Multimodal Large Language Models Against Modality Conflict

Authors: Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised finetuning method shows promising and stable performance.
Researcher Affiliation Collaboration 1School of Artificial Intelligence and Data Science, University of Science and Technology of China 2Department of Electronic Engineering and Information Science, University of Science and Technology of China 3Huawei Technologies Co., Ltd..
Pseudocode No The paper describes methods (prompt engineering, supervised fine-tuning, and reinforcement learning) and their formulations but does not include any explicit pseudocode blocks or algorithms. Figure 2 illustrates a pipeline but is not pseudocode.
Open Source Code Yes The code and dataset are available at https://github.com/zmzhang2000/ MMMC.
Open Datasets Yes The code and dataset are available at https://github.com/zmzhang2000/ MMMC. ... we collect images from the widely-used vision-language datasets, Visual Genome (Krishna et al., 2017), and construct natural language questions conflicting with the image content and corresponding answers.
Dataset Splits Yes Finally, we obtain 20K image-question-answer triples in the MMMC dataset and randomly split them into 18K training samples and 2K testing samples.
Hardware Specification No This work was supported by National Key R&D Program of China under Contract 2022ZD0119802, National Natural Science Foundation of China under Contract 623B2097, and the Youth Innovation Promotion Association CAS. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC. The paper mentions a 'GPU cluster' but does not provide specific details like GPU models, CPU specifications, or memory.
Software Dependencies No We implement all proposed method using Hugging Face Transformers and Open RLHF library (Hu et al., 2024). ... We use the Llama-3.3-70B-Instruct for reward model. The paper mentions software libraries and models but does not provide specific version numbers for Hugging Face Transformers or Open RLHF library.
Experiment Setup Yes For the supervised fine-tuning, we use the Adam optimizer with a learning rate of 5 10 6 and a batch size of 8. We train the model for 1 epochs on the MMMC dataset with 10000 training samples except for the ablation study. For the reinforcement learning, we use the Adam optimizer with a learning rate of 9.65 10 6 and a batch size of 8. We train the model on the MMMC dataset with only 1000 training samples since longer reinforcement learning will cause the model collapse. We set the KL coefficient to 0.01 and the max response length to 128. Both the supervised fine-tuning and reinforcement learning methods are trained with Lo RA (Hu et al., 2021).