reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Authors: Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on the popular benchmarks VERITE, AVERITEC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new general state-of-the-art fact-checking system for uniand multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, CLAIMREVIEW2024+, featuring claims after the knowledge cutoff of GPT-4O, avoiding data leakage. Here, DEFAME drastically outperforms the GPT4O baselines, showing temporal generalizability and the potential for real-time fact-checking2.
Researcher Affiliation	Academia	1Technical University of Darmstadt & hessian.AI, Germany. Correspondence to: Mark Rothermel <EMAIL>.
Pseudocode	Yes	F. Formal Representation of DEFAME We include a formalization of the DEFAME pipeline to clarify the role of each stage and its iterative structure. Let T and I denote the spaces of text and images, respectively. Define M := (T I) as the space of multimodal sequences, and let Y denote the space of verdict labels. Then, DEFAME is a function F : M M Y, F(c) = (Rout, yout), where, given a claim c M, the output consists of a report Rout containing the full fact-check and a predicted verdict yout. DEFAME proceeds iteratively up to N steps. We can denote each iteration with Fiter : M M Y, so that (R(i+1), y(i+1)) := Fiter(R(i)), i.e., an (incomplete) report R(i) M gets extended with new actions, evidence, and elaboration, resulting in report R(i+1) and intermediate verdict y(i+1). We can decompose Fiter into the five individual pipeline stages Fiter := S5 S4 S3 S2 S1,
Open Source Code	Yes	We released the code and benchmark dataset publicly at: https://github.com/multimodal-ai-lab/ DEFAME/tree/icml
Open Datasets	Yes	Evaluation on the popular benchmarks VERITE, AVERITEC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new general state-of-the-art fact-checking system for uniand multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, CLAIMREVIEW2024+, featuring claims after the knowledge cutoff of GPT-4O, avoiding data leakage. Here, DEFAME drastically outperforms the GPT4O baselines, showing temporal generalizability and the potential for real-time fact-checking2.
Dataset Splits	Yes	AVERITEC (Schlichtkrull et al., 2024b) is a popular textonly real-world-based benchmark. The development set consists of 500 claims: 305 Refuted, 122 Supported, 35 NEI (Not Enough Information), and 38 claims with the C/CP (Conflicting/Cherrypicking) label that designates claims with conflicting evidence or claims that are technically true but lack context. We retrieve evidence from the benchmark-complementary Knowledge Base (KB), which contains the necessary evidence along with approximately 1,000 unrelated resources to simulate open web search. Thus, for AVERITEC, the Web Search Tool does not utilize the Serper API but rather a semantic search, yielding 5 results for each search. Each query to the KB is encoded using gte-base-en-v1.5 (Alibaba-NLP, 2024); the closest documents to the search query are retrieved via k-nearest neighbor. We report the accuracy over all 4 classes. MOCHEG (Yao et al., 2023) features textual claims paired with text-image evidence. Its multimodal nature qualifies it as a benchmark for DEFAME. Out of the 2,001 unique claims in the test set, we choose the 1,689 claims that have a final ruling, useful to assess the quality of generated justifications (Appendix M). That subset includes 667 Refuted, 522 NEI, and 500 Supported claims. We evaluate our performance using accuracy equivalent to micro-F1 (Appendix J). VERITE (Papadopoulos et al., 2024b) is an imagetext verification benchmark focused on Out-Of-Context (OOC) scenarios. After removing 13 incomplete instances, VERITE comprises 1,001 samples, sourced partly from fact-checking platforms and partly generated by swapping images or altering captions. The dataset includes 338 True, 325 OOC, and 338 Miscaptioned claims (OOC and miscaptioned claims differ in construction but both involve out-of-context imagery). Following Papadopoulos et al. (2024a), we report accuracy for True vs. OOC and True vs. Miscaptioned, as well as a merged True vs. False setup. [...] The dataset consists of 160 unimodal (text-only) and 140 multimodal (text-image) claims (cf. Fig. 3). [...] The final label distribution is: 129 Refuted, 89 Supported, 61 Misleading, and 21 NEI.
Hardware Specification	Yes	On our end, we employed four NVIDIA A100-80GB GPUs in order to execute LLAVA-1V and GEOCLIP. All other processing was performed on 32 AMD EPYC 7313 16-core CPUs.
Software Dependencies	No	The paper mentions software components like GPT-4O, GPT-4O MINI, LLAVA-ONEVISION (1V), LLAMA 4 SCOUT, Serper API, Google Vision API, Firecrawl, and gte-base-en-v1.5. While some versions are implied for the MLLMs (e.g., (7B) for LLAVA-1V, -v1.5 for gte-base-en), it does not provide specific version numbers for multiple key software components or libraries required to replicate the environment, such as Python or common ML frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	We chose GPT-4O and GPT-4O MINI as the backbone of DEFAME since they are the current state-of-the-art MLLMs. To account for open-source MLLMs, we also test LLAVA-ONEVISION (1V) (7B) (Li et al., 2024a), and LLAMA 4 SCOUT (Meta AI, 2025). DEFAME includes the MLLM without any fine-tuning, with temperature set to 0.01 and top-p to 0.9 to control response diversity. We limit the number of images per scraped web page to a maximum of 32 to avoid an excessive flood of images. DEFAME processes interleaved text-image inputs, preserving the original position of images within the text context, but any input exceeding the MLLM s maximum context window is truncated accordingly.