reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, YANDONG LI, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments find that all state-of-the-art OLMs struggles with Omni R questions that require integrating information from multiple modalities to answer. Further analysis highlight differences in reasoning behavior and underscoring the challenges of omni-modal AI alignment.
Researcher Affiliation	Collaboration	Google Deep Mind1; University of Maryland, College Park2
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Methods are described in prose without code-like formatting or explicit labels such as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code of the methodology described. Footnote 4 links to examples: 'https://anonymous.4open.science/r/Omnix R-Examples-7961/'.
Open Datasets	No	The paper introduces Omni RSYNTH, derived from MMLUPro (Wang et al., 2024), and Omni RREAL, a manually collected and annotated dataset from YouTube. However, it does not explicitly state that the Omni R (SYNTH or REAL) datasets themselves are publicly available, nor does it provide specific links or access information for them. Footnote 4 provides examples but not the full dataset.
Dataset Splits	Yes	We randomly sample 100 questions from each of the 14 categories in MMLU-Pro to construct Omni RSYNTH. From these, 100 videos are carefully selected to construct a high-quality set, Omni RREAL. Figure 2 also shows '#Test Samples: 400 #Each Modality: 100' for Omni RREAL, and '100 Examples in each category. 1400 exmaples in each modality.' for Omni RSYNTH.
Hardware Specification	No	The paper discusses the capabilities and limitations of different model APIs (Gemini, GPT-4o, Claude) regarding modalities and input limits, but it does not specify the hardware used by the authors to conduct their experiments or make API calls.
Software Dependencies	Yes	The version of the Geminis we used in this paper are Gemini-1.5-Pro-001 and Gemini-1.5-Flash-001. The version of the Open AI models we used are gpt-4o-2024-05-13, and gpt-4o-mini-202407-18. The verison of the Claude models we used are claude-3-sonnet@20240229, claude-3opus@20240229, claude-3-haiku@20240307.
Experiment Setup	Yes	Following them, we use Co T with 0-shot, as our standard setting, i.e., the prompt used for evaluation is Think step by step then output the answer in the format of The answer is (X) at the end. For response generation, we follow the commonly used settings, temperature=0.7, top p=0.9, and output length=1024, for all the models, i.e., Gemini, Claude, GPT models.