OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
Authors: Lichang Chen, Hexiang Hu, Mingda Zhang, Yiwen Chen, Zifeng Wang, YANDONG LI, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments find that all state-of-the-art OLMs struggles with Omni R questions that require integrating information from multiple modalities to answer. Further analysis highlight differences in reasoning behavior and underscoring the challenges of omni-modal AI alignment. |
| Researcher Affiliation | Collaboration | Google Deep Mind1; University of Maryland, College Park2 |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Methods are described in prose without code-like formatting or explicit labels such as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the methodology described. Footnote 4 links to examples: 'https://anonymous.4open.science/r/Omnix R-Examples-7961/'. |
| Open Datasets | No | The paper introduces Omni RSYNTH, derived from MMLUPro (Wang et al., 2024), and Omni RREAL, a manually collected and annotated dataset from YouTube. However, it does not explicitly state that the Omni R (SYNTH or REAL) datasets themselves are publicly available, nor does it provide specific links or access information for them. Footnote 4 provides examples but not the full dataset. |
| Dataset Splits | Yes | We randomly sample 100 questions from each of the 14 categories in MMLU-Pro to construct Omni RSYNTH. From these, 100 videos are carefully selected to construct a high-quality set, Omni RREAL. Figure 2 also shows '#Test Samples: 400 #Each Modality: 100' for Omni RREAL, and '100 Examples in each category. 1400 exmaples in each modality.' for Omni RSYNTH. |
| Hardware Specification | No | The paper discusses the capabilities and limitations of different model APIs (Gemini, GPT-4o, Claude) regarding modalities and input limits, but it does not specify the hardware used by the authors to conduct their experiments or make API calls. |
| Software Dependencies | Yes | The version of the Geminis we used in this paper are Gemini-1.5-Pro-001 and Gemini-1.5-Flash-001. The version of the Open AI models we used are gpt-4o-2024-05-13, and gpt-4o-mini-202407-18. The verison of the Claude models we used are claude-3-sonnet@20240229, claude-3opus@20240229, claude-3-haiku@20240307. |
| Experiment Setup | Yes | Following them, we use Co T with 0-shot, as our standard setting, i.e., the prompt used for evaluation is Think step by step then output the answer in the format of The answer is (X) at the end. For response generation, we follow the commonly used settings, temperature=0.7, top p=0.9, and output length=1024, for all the models, i.e., Gemini, Claude, GPT models. |