reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MixEval-X: Any-to-any Evaluations from Real-world Data Mixture

Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Yuntian Deng, Andy Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Qizhe Shieh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the evaluation results, and the settings are detailed in Section C. ... The evaluation results of prominent models on Mix Eval-X Image2Text, Image2Text Hard, and their subsets. ... The overall Elo scores of MMG models on the Mix Eval-X MMG subsets... The evaluation results of prominent models on Text2Action.
Researcher Affiliation	Academia	Jinjie Nia , Yifan Songc, Deepanway Ghosalf , Bo Lib, David Junhao Zhanga, Xiang Yued, Fuzhao Xuea, Zian Zhenge, Kaichen Zhangb, Mahir Shaha, Kabir Jaina, Yang Youa, Michael Qizhe Shieha a National University of Singapore, b Nanyang Technological University, c Peking University, d Carnegie Mellon University, e University of Waterloo, f Independent Researcher
Pseudocode	No	The paper describes the methodology using prose and diagrams (e.g., Figure 2: The overall pipeline for creating Mix Eval-X, and Section E: Adaptation-Rectification Prompts which shows LLM prompts). It does not contain any clearly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code	Yes	https://mixeval-x.github.io/
Open Datasets	Yes	For MMU tasks, we construct large-scale multi-modal benchmark pools from existing community benchmarks... The benchmark pool composition is detailed in Section G. ... Section G BENCHMARK POOL DETAILS Image2Text: MMMU (Yue et al., 2024), MMBench (Liu et al., 2023b), SEED-Bench (Li et al., 2023b), SEED-Bench 2 (Li et al., 2024b), Chart QA (Masry et al., 2022), A-OKVQA (Schwenk et al., 2022), Hallusion Bench (Guan et al., 2024), Math Vista (Lu et al., 2023), GQA (Hudson & Manning, 2019), MM-Vet (Yu et al., 2023b), Science QA (Saikh et al., 2022), Doc VQA (Mathew et al., 2021), POPE (Li et al., 2023e), Infographic VQA (Mathew et al., 2022), Q-Bench (Wu et al., 2023), Vis Wiz (Gurari et al., 2018), and Text VQA (Singh et al., 2019)
Dataset Splits	Yes	Table 1 presents the statistics for the Mix Eval-X benchmarks. We regulate task count and input lengths for efficiency... Image2Text MMU 2,000, Image2text-Hard MMU 1,000... To enhance model differentiation, we applied rejection sampling (Ni et al., 2024) to select more challenging MMU tasks while preserving real-world distribution alignment. The effectiveness of this strategy is demonstrated later by the low scores on the hard split in Section 3.1
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. Section C, 'EVALUATION SETTINGS', states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs,' but does not elaborate on the hardware used for these evaluations.
Software Dependencies	No	The paper mentions using NLTK tokenizer (Loper & Bird, 2002) and GPT-4 for certain tasks, but does not specify version numbers for these or any other ancillary software dependencies, which is required for reproducibility.
Experiment Setup	No	The paper describes the methodology for creating the Mix Eval-X benchmark and the evaluation process for models. Section C, 'EVALUATION SETTINGS,' states, 'We follow official settings for all open-source models to ensure fairness. For proprietary models, we use their official APIs.' However, it does not provide specific experimental setup details such as hyperparameters, learning rates, batch sizes, or optimizer settings for any models or processes described, as its focus is on evaluation rather than training or fine-tuning models.