Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

LiveXiv - A Multi-Modal live benchmark based on Arxiv papers content

Authors: Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, Muhammad Jehanzeb Mirza, Leshem Choshen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset.
Researcher Affiliation Collaboration 1 Faculty of Engineering Tel-Aviv University, 2 IBM Research, 3 Department of Statistics, University of Michigan, USA 4 JKU Linz, Austria, 5 MIT CSAIL. 6 MIT-IBM
Pseudocode No The paper describes methods and processes in narrative form within sections like '3.1 DATA ACQUISITION AND VQA GENERATION' and '3.2 FILTERING PHASE', but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Our dataset is available online on Hugging Face and our code is available on Git Hub.
Open Datasets Yes Our dataset is available online on Hugging Face and our code is available on Git Hub.
Dataset Splits No The paper describes the creation and characteristics of the Live Xiv benchmark, which is used for evaluating existing LMMs. It does not define training/test/validation splits for this benchmark itself, as it functions as an evaluation set rather than a dataset for training a new model within the scope of this paper.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments or generating the benchmark data.
Software Dependencies No The paper mentions tools and models like 'Deep Search toolkit (Team, 2022)', 'GPT-4o', 'Claude-Sonnet', 'LLama-3.1-70B (Meta, 2024)', and 'Huggingface API' but does not specify software library dependencies with version numbers.
Experiment Setup Yes The VQA process involves two steps using GPT-4o. First, we input the figure and its caption to GPT-4o to generate a detailed description of the figure, employing a Chain-of-Thought (Co T) approach (Wei et al., 2022). Next, the detailed description and figure are fed back into GPT-4o, with prompts adapted from Con Me (Huang et al., 2024) to suit our scientific use case, enabling the generation of relevant VQA questions. For questions from the tables, we utilize the table s content directly, presenting both the image of the table and its data in markdown format to GPT-4o to produce questions that require common-sense reasoning and data manipulation. The automated nature of this process ensures a robust and comprehensive evaluation framework for LMMs, tailored to scientific literature specifics. Detailed prompt templates can be found in Appendix A.6.