reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Authors: Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-Min Hu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate many models such as o1, GPT-4o, Deep Seek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning.
Researcher Affiliation	Collaboration	1Tsinghua University, 2Stanford University, 3CMU, 4University of Pennsylvania, 5Tencent Hunyuan X, 6Fitten.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes a pipeline visually in Fig. 2 and provides an example of a Co T prompt in the appendix, but neither constitutes a general algorithm block.
Open Source Code	Yes	Data and code are made publicly available at here. ... Later, we will make the data and code available, hoping to provide guidance and insight for the development of foundation models. ... All data and code are publicly released to foster transparency and reproducibility, making R-Bench a valuable asset for the community.
Open Datasets	Yes	Data and code are made publicly available at here. ... After completing the aforementioned steps, we develop RBench, a graduate-level, multi-discipline, multilingual benchmark... ... Later, we will make the data and code available... ... All data and code are publicly released to foster transparency and reproducibility, making R-Bench a valuable asset for the community.
Dataset Splits	No	The paper defines RBench-T (1,094 questions across 108 subjects for language models) and RBench-M (665 questions across 83 subjects for multimodal models) as evaluation benchmarks. However, it does not specify any training/validation/test splits for models using this benchmark, as it is primarily for evaluation rather than training.
Hardware Specification	No	The paper mentions utilizing API calls and deploying open-source models locally using v LLM and VLMEval Kit, but it does not specify any hardware details like GPU models, CPU types, or memory specifications used for these deployments or API calls.
Software Dependencies	No	The paper mentions using tools like GPT-4o, Mathpix, vLLM (Kwon et al., 2023), Open Compass (Contributors, 2023), and VLMEval Kit (Duan et al., 2024). However, it does not provide specific version numbers for any of these software components or libraries, beyond the year in the citations for vLLM, Open Compass, and VLMEval Kit which are not version numbers for the software itself.
Experiment Setup	Yes	For API calls, we utilize the official interfaces with default hyperparameters. For open-source models, we deploy their weights locally using v LLM (Kwon et al., 2023), setting the temperature to 0 while keeping all other parameters at their default values. The evaluation was conducted using the tools provided by Open Compass (Contributors, 2023). In all tests, the Co T prompt is used by default. For details on the specific prompts, please refer to our appendix.