reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Authors: Zhongshen Zeng, Pengguang Chen, Shu Liu, Haiyun Jiang, Jiaya Jia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Specifically, we found the Open AI o1 models which possess characteristics of "system-2" thinking excel the other SOTA models by more than 20 absolute points in our benchmark, supporting our deficiency hypothesis.
Researcher Affiliation	Collaboration	Zhongshen Zeng Chinese University of Hong Kong EMAIL Pengguang Chen Chinese University of Hong Kong EMAIL Shu Liu Smartmore Co.Ltd EMAIL Haiyun Jiang Fudan University EMAIL Jiaya Jia Chinese University of Hong Kong EMAIL
Pseudocode	No	The paper describes methods and processes like data construction and evaluation in narrative form and through figures showing prompts, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code	Yes	We introduce a novel evaluation principle, the accompanying open-source benchmark MRGSM8k, and the metric MR-Score. (...) The same problems can be found in the MR-GSM8K.json file in our open sourced repository.
Open Datasets	Yes	To prove our point, we applied our paradigm to GSM8K dataset and developed the MR-GSM8K benchmark. Our benchmark, characterized by instances manually labeled by experts and rigorously reviewed, serves as a robust tool for both qualitative and quantitative assessments of language models.
Dataset Splits	Yes	Table-1 presents the statistics of MR-GSM8K, illustrating the distribution of correct and incorrect solutions across the three different types of questions. (...) We merged the GSM8K training set with the GPT-4 generated diagnostic data, consisting of 5k incorrect solutions and 4k correct solutions.
Hardware Specification	No	The paper mentions evaluating models of various sizes (e.g., "from a few billion parameters, such as Qwen-v1.5-1.8B (...) to 70 billion parameters like Llama3-70B (...), and up to 236 billion parameters as seen in Deepseek-v2-236B") and setting inference temperatures, but it does not specify any particular hardware (GPU models, CPU models, or memory) used for running these experiments.
Software Dependencies	No	For fine-tuning, we employed the Qlora method (Dettmers et al., 2023), maintaining the same hyperparameters as used for Meta Math-70B. The paper mentions the Qlora method but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	Each model was evaluated under a zero-shot setting to assess their ability to follow instructions and their mathematical reasoning capabilities. (...) To ensure reproducibility and minimize variance, the inference temperature was set to zero across all models except o1 series, whose temperature is hardcoded to 1 in APIs. (...) For fine-tuning, we employed the Qlora method (Dettmers et al., 2023), maintaining the same hyperparameters as used for Meta Math-70B.