MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Specifically, we found the Open AI o1 models which possess characteristics of "system-2" thinking excel the other SOTA models by more than 20 absolute points in our benchmark, supporting our deficiency hypothesis.
Researcher Affiliation Collaboration Zhongshen Zeng Chinese University of Hong Kong EMAIL Pengguang Chen Chinese University of Hong Kong EMAIL Shu Liu Smartmore Co.Ltd EMAIL Haiyun Jiang Fudan University EMAIL Jiaya Jia Chinese University of Hong Kong EMAIL
Pseudocode No The paper describes methods and processes like data construction and evaluation in narrative form and through figures showing prompts, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code Yes We introduce a novel evaluation principle, the accompanying open-source benchmark MRGSM8k, and the metric MR-Score. (...) The same problems can be found in the MR-GSM8K.json file in our open sourced repository.
Open Datasets Yes To prove our point, we applied our paradigm to GSM8K dataset and developed the MR-GSM8K benchmark. Our benchmark, characterized by instances manually labeled by experts and rigorously reviewed, serves as a robust tool for both qualitative and quantitative assessments of language models.
Dataset Splits Yes Table-1 presents the statistics of MR-GSM8K, illustrating the distribution of correct and incorrect solutions across the three different types of questions. (...) We merged the GSM8K training set with the GPT-4 generated diagnostic data, consisting of 5k incorrect solutions and 4k correct solutions.
Hardware Specification No The paper mentions evaluating models of various sizes (e.g., "from a few billion parameters, such as Qwen-v1.5-1.8B (...) to 70 billion parameters like Llama3-70B (...), and up to 236 billion parameters as seen in Deepseek-v2-236B") and setting inference temperatures, but it does not specify any particular hardware (GPU models, CPU models, or memory) used for running these experiments.
Software Dependencies No For fine-tuning, we employed the Qlora method (Dettmers et al., 2023), maintaining the same hyperparameters as used for Meta Math-70B. The paper mentions the Qlora method but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes Each model was evaluated under a zero-shot setting to assess their ability to follow instructions and their mathematical reasoning capabilities. (...) To ensure reproducibility and minimize variance, the inference temperature was set to zero across all models except o1 series, whose temperature is hardcoded to 1 in APIs. (...) For fine-tuning, we employed the Qlora method (Dettmers et al., 2023), maintaining the same hyperparameters as used for Meta Math-70B.