OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities

Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, YANDONG LI, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments find that all state-of-the-art OLMs struggles with Omni R questions that require integrating information from multiple modalities to answer. Further analysis highlight differences in reasoning behavior and underscoring the challenges of omni-modal AI alignment.
Researcher Affiliation Collaboration Google Deep Mind1; University of Maryland, College Park2
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Methods are described in prose without code-like formatting or explicit labels such as 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of the methodology described. Footnote 4 links to examples: 'https://anonymous.4open.science/r/Omnix R-Examples-7961/'.
Open Datasets No The paper introduces Omni RSYNTH, derived from MMLUPro (Wang et al., 2024), and Omni RREAL, a manually collected and annotated dataset from YouTube. However, it does not explicitly state that the Omni R (SYNTH or REAL) datasets themselves are publicly available, nor does it provide specific links or access information for them. Footnote 4 provides examples but not the full dataset.
Dataset Splits Yes We randomly sample 100 questions from each of the 14 categories in MMLU-Pro to construct Omni RSYNTH. From these, 100 videos are carefully selected to construct a high-quality set, Omni RREAL. Figure 2 also shows '#Test Samples: 400 #Each Modality: 100' for Omni RREAL, and '100 Examples in each category. 1400 exmaples in each modality.' for Omni RSYNTH.
Hardware Specification No The paper discusses the capabilities and limitations of different model APIs (Gemini, GPT-4o, Claude) regarding modalities and input limits, but it does not specify the hardware used by the authors to conduct their experiments or make API calls.
Software Dependencies Yes The version of the Geminis we used in this paper are Gemini-1.5-Pro-001 and Gemini-1.5-Flash-001. The version of the Open AI models we used are gpt-4o-2024-05-13, and gpt-4o-mini-202407-18. The verison of the Claude models we used are claude-3-sonnet@20240229, claude-3opus@20240229, claude-3-haiku@20240307.
Experiment Setup Yes Following them, we use Co T with 0-shot, as our standard setting, i.e., the prompt used for evaluation is Think step by step then output the answer in the format of The answer is (X) at the end. For response generation, we follow the commonly used settings, temperature=0.7, top p=0.9, and output length=1024, for all the models, i.e., Gemini, Claude, GPT models.