MixEval-X: Any-to-any Evaluations from Real-world Data Mixture
Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Yuntian Deng, Andy Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Qizhe Shieh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the evaluation results, and the settings are detailed in Section C. ... The evaluation results of prominent models on Mix Eval-X Image2Text, Image2Text Hard, and their subsets. ... The overall Elo scores of MMG models on the Mix Eval-X MMG subsets... The evaluation results of prominent models on Text2Action. |
| Researcher Affiliation | Academia | Jinjie Nia , Yifan Songc, Deepanway Ghosalf , Bo Lib, David Junhao Zhanga, Xiang Yued, Fuzhao Xuea, Zian Zhenge, Kaichen Zhangb, Mahir Shaha, Kabir Jaina, Yang Youa, Michael Qizhe Shieha a National University of Singapore, b Nanyang Technological University, c Peking University, d Carnegie Mellon University, e University of Waterloo, f Independent Researcher |
| Pseudocode | No | The paper describes the methodology using prose and diagrams (e.g., Figure 2: The overall pipeline for creating Mix Eval-X, and Section E: Adaptation-Rectification Prompts which shows LLM prompts). It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps in a code-like format. |
| Open Source Code | Yes | https://mixeval-x.github.io/ |
| Open Datasets | Yes | For MMU tasks, we construct large-scale multi-modal benchmark pools from existing community benchmarks... The benchmark pool composition is detailed in Section G. ... Section G BENCHMARK POOL DETAILS Image2Text: MMMU (Yue et al., 2024), MMBench (Liu et al., 2023b), SEED-Bench (Li et al., 2023b), SEED-Bench 2 (Li et al., 2024b), Chart QA (Masry et al., 2022), A-OKVQA (Schwenk et al., 2022), Hallusion Bench (Guan et al., 2024), Math Vista (Lu et al., 2023), GQA (Hudson & Manning, 2019), MM-Vet (Yu et al., 2023b), Science QA (Saikh et al., 2022), Doc VQA (Mathew et al., 2021), POPE (Li et al., 2023e), Infographic VQA (Mathew et al., 2022), Q-Bench (Wu et al., 2023), Vis Wiz (Gurari et al., 2018), and Text VQA (Singh et al., 2019) |
| Dataset Splits | Yes | Table 1 presents the statistics for the Mix Eval-X benchmarks. We regulate task count and input lengths for efficiency... Image2Text MMU 2,000, Image2text-Hard MMU 1,000... To enhance model differentiation, we applied rejection sampling (Ni et al., 2024) to select more challenging MMU tasks while preserving real-world distribution alignment. The effectiveness of this strategy is demonstrated later by the low scores on the hard split in Section 3.1 |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. Section C, 'EVALUATION SETTINGS', states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs,' but does not elaborate on the hardware used for these evaluations. |
| Software Dependencies | No | The paper mentions using NLTK tokenizer (Loper & Bird, 2002) and GPT-4 for certain tasks, but does not specify version numbers for these or any other ancillary software dependencies, which is required for reproducibility. |
| Experiment Setup | No | The paper describes the methodology for creating the Mix Eval-X benchmark and the evaluation process for models. Section C, 'EVALUATION SETTINGS,' states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs.' However, it does not provide specific experimental setup details such as hyperparameters, learning rates, batch sizes, or optimizer settings for any models or processes described, as its focus is on evaluation rather than training or fine-tuning models. |