RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Authors: Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-Min Hu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate many models such as o1, GPT-4o, Deep Seek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning.
Researcher Affiliation Collaboration 1Tsinghua University, 2Stanford University, 3CMU, 4University of Pennsylvania, 5Tencent Hunyuan X, 6Fitten.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes a pipeline visually in Fig. 2 and provides an example of a Co T prompt in the appendix, but neither constitutes a general algorithm block.
Open Source Code Yes Data and code are made publicly available at here. ... Later, we will make the data and code available, hoping to provide guidance and insight for the development of foundation models. ... All data and code are publicly released to foster transparency and reproducibility, making R-Bench a valuable asset for the community.
Open Datasets Yes Data and code are made publicly available at here. ... After completing the aforementioned steps, we develop RBench, a graduate-level, multi-discipline, multilingual benchmark... ... Later, we will make the data and code available... ... All data and code are publicly released to foster transparency and reproducibility, making R-Bench a valuable asset for the community.
Dataset Splits No The paper defines RBench-T (1,094 questions across 108 subjects for language models) and RBench-M (665 questions across 83 subjects for multimodal models) as evaluation benchmarks. However, it does not specify any training/validation/test splits for models *using* this benchmark, as it is primarily for evaluation rather than training.
Hardware Specification No The paper mentions utilizing API calls and deploying open-source models locally using v LLM and VLMEval Kit, but it does not specify any hardware details like GPU models, CPU types, or memory specifications used for these deployments or API calls.
Software Dependencies No The paper mentions using tools like GPT-4o, Mathpix, vLLM (Kwon et al., 2023), Open Compass (Contributors, 2023), and VLMEval Kit (Duan et al., 2024). However, it does not provide specific version numbers for any of these software components or libraries, beyond the year in the citations for vLLM, Open Compass, and VLMEval Kit which are not version numbers for the software itself.
Experiment Setup Yes For API calls, we utilize the official interfaces with default hyperparameters. For open-source models, we deploy their weights locally using v LLM (Kwon et al., 2023), setting the temperature to 0 while keeping all other parameters at their default values. The evaluation was conducted using the tools provided by Open Compass (Contributors, 2023). In all tests, the Co T prompt is used by default. For details on the specific prompts, please refer to our appendix.