reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

Authors: Xuannan Liu, Zekun Li, Pei Li, Huaibo Huang, Shuhan Xia, Xing Cui, Linzhi Huang, Weihong Deng, Zhaofeng He

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We further conduct an extensive evaluation of 6 prevalent detection methods and 15 Large Vision-Language Models (LVLMs) on MMFake Bench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose MMD-Agent, a novel approach to integrate the reasoning, action, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization.
Researcher Affiliation	Academia	1Beijing University of Posts and Telecommunications 2University of California, Santa Barbara 3Center for Research on Intelligent Perception and Computing, NLPR, CASIA EMAIL EMAIL
Pseudocode	No	The paper describes the MMD-Agent framework and its stages (textual veracity check, visual veracity check, and cross-modal consistency reason) and illustrates them with a diagram in Figure 4, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	We open-source datasets and detection codes but do not release data generation codes for safety.
Open Datasets	Yes	We introduce MMFake Bench, the first comprehensive benchmark for evaluating mixed-source MMD. ... (Yue et al., 2024). ... FEVER (Thorne et al., 2018) ... Politifact (Shu et al., 2020) ... Gossipcop (Shu et al., 2020) ... Snopes (Hanselowski et al., 2019) ... MOCHEG (Yao et al., 2023) ... LLMFake (Chen & Shu, 2024) ... EMU (Da et al., 2021) ... Fakeddit (Nakamura et al., 2020) ... MAIM (Jaiswal et al., 2017) ... MEIR (Sabir et al., 2018) ... News CLIPpings (Luo et al., 2021) ... COSMOS (Aneja et al., 2023) ... DGM4 (Shao et al., 2023) ... MS-COCO (Lin et al., 2014) and Visual News datasets (Liu et al., 2021). ... COCO-Counterfactuals (Le et al., 2023). ... All datasets provided in this work are licensed under the Attribution Non-Commercial Share Alike 4.0 International (CC BY-NC-SA 4.0) license.
Dataset Splits	Yes	MMFake Bench consists of 11,000 image-text pairs, which are divided into a validation set and a test set following (Yue et al., 2024). The validation set, comprising 1,000 image-text pairs, is intended for hyperparameter selection, while the test set contains 10,000 pairs.
Hardware Specification	Yes	All experiments are performed on eight NVIDIA Ge Force 3090 GPUs with Py Torch.
Software Dependencies	Yes	All experiments are performed on eight NVIDIA Ge Force 3090 GPUs with Py Torch. ... As for Chat GPT model, we use GPT-3.5 (gpt-3.5-turbo) or GPT-4 (gpt-4-vision-preview) as generators or detectors. As for text-to-image models, we use DALLE (DALLE-E3), Stable-Diffusion (Stable Diffusion XL), and Midjourney (Midjourney V6).
Experiment Setup	Yes	To achieve the justified evaluation, we have set the sampling hyperparameter of the off-the-shelf LVLMs, do_sample = False or Temperature = 0 , to guarantee consistency in the predicted outputs. We adopt the default setting of other hyperparameters such as max_new_tokens = 512 .