MixEval-X: Any-to-any Evaluations from Real-world Data Mixture

Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Yuntian Deng, Andy Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Qizhe Shieh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present the evaluation results, and the settings are detailed in Section C. ... The evaluation results of prominent models on Mix Eval-X Image2Text, Image2Text Hard, and their subsets. ... The overall Elo scores of MMG models on the Mix Eval-X MMG subsets... The evaluation results of prominent models on Text2Action.
Researcher Affiliation Academia Jinjie Nia , Yifan Songc, Deepanway Ghosalf , Bo Lib, David Junhao Zhanga, Xiang Yued, Fuzhao Xuea, Zian Zhenge, Kaichen Zhangb, Mahir Shaha, Kabir Jaina, Yang Youa, Michael Qizhe Shieha a National University of Singapore, b Nanyang Technological University, c Peking University, d Carnegie Mellon University, e University of Waterloo, f Independent Researcher
Pseudocode No The paper describes the methodology using prose and diagrams (e.g., Figure 2: The overall pipeline for creating Mix Eval-X, and Section E: Adaptation-Rectification Prompts which shows LLM prompts). It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code Yes https://mixeval-x.github.io/
Open Datasets Yes For MMU tasks, we construct large-scale multi-modal benchmark pools from existing community benchmarks... The benchmark pool composition is detailed in Section G. ... Section G BENCHMARK POOL DETAILS Image2Text: MMMU (Yue et al., 2024), MMBench (Liu et al., 2023b), SEED-Bench (Li et al., 2023b), SEED-Bench 2 (Li et al., 2024b), Chart QA (Masry et al., 2022), A-OKVQA (Schwenk et al., 2022), Hallusion Bench (Guan et al., 2024), Math Vista (Lu et al., 2023), GQA (Hudson & Manning, 2019), MM-Vet (Yu et al., 2023b), Science QA (Saikh et al., 2022), Doc VQA (Mathew et al., 2021), POPE (Li et al., 2023e), Infographic VQA (Mathew et al., 2022), Q-Bench (Wu et al., 2023), Vis Wiz (Gurari et al., 2018), and Text VQA (Singh et al., 2019)
Dataset Splits Yes Table 1 presents the statistics for the Mix Eval-X benchmarks. We regulate task count and input lengths for efficiency... Image2Text MMU 2,000, Image2text-Hard MMU 1,000... To enhance model differentiation, we applied rejection sampling (Ni et al., 2024) to select more challenging MMU tasks while preserving real-world distribution alignment. The effectiveness of this strategy is demonstrated later by the low scores on the hard split in Section 3.1
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. Section C, 'EVALUATION SETTINGS', states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs,' but does not elaborate on the hardware used for these evaluations.
Software Dependencies No The paper mentions using NLTK tokenizer (Loper & Bird, 2002) and GPT-4 for certain tasks, but does not specify version numbers for these or any other ancillary software dependencies, which is required for reproducibility.
Experiment Setup No The paper describes the methodology for creating the Mix Eval-X benchmark and the evaluation process for models. Section C, 'EVALUATION SETTINGS,' states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs.' However, it does not provide specific experimental setup details such as hyperparameters, learning rates, batch sizes, or optimizer settings for any models or processes described, as its focus is on evaluation rather than training or fine-tuning models.