reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations

Authors: Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, Mengdi Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We observe significant performance drops on MATH-P-Hard across various models, including o1-mini ( 16.49%) and gemini-2.0flash-thinking ( 12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available here.
Researcher Affiliation	Collaboration	Kaixuan Huang 1 Jiacheng Guo 1 Zihao Li 1 Xiang Ji 1 Jiawei Ge 1 Wenzhe Li 1 Yingqing Guo 1 Tianle Cai 1 Hui Yuan 1 Runzhe Wang 1 Yue Wu 1 Ming Yin 1 Shange Tang 1 Yangsibo Huang 2 Chi Jin 1 Xinyun Chen 2 Chiyuan Zhang 2 Mengdi Wang 1 1Princeton University 2Google.
Pseudocode	No	The paper describes methods and problem-solving steps but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The project is available here.
Open Datasets	Yes	We choose the popular MATH benchmark (Hendrycks et al., 2021), which contains challenging mathematical reasoning problems sourced from American high school mathematics competitions such as the AMC 10, AMC 12, and AIME. ... We design and construct MATH-P-Simple (simple perturbation) and MATH-P-Hard (hard perturbation), each consisting of 279 perturbed math problems that originate from the level-5 (hardest) problems of the MATH dataset (Hendrycks et al., 2021). ... The project is available here.
Dataset Splits	Yes	We use level-5 problems from both the train split and the test split as the seed problems, so we are able to investigate whether language models behave differently on the two splits. ... After removing several annotations that failed the quality checks, we obtained 279 pairs of modifications, where 164 examples are from train split and 115 examples are from test split.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only general statements about evaluating LLMs.
Software Dependencies	No	The paper mentions using the 'sympy package' for checking equivalence, but does not specify its version number. While it lists models and their versions in Appendix A, these are the subjects of evaluation, not ancillary software dependencies in the sense of development tools or libraries.
Experiment Setup	No	We adopt zero-shot chain-of-thought (Co T) (Wei et al., 2022; Kojima et al., 2022) as the standard evaluation method on our benchmarks. For comparison, we also evaluate the models on the set of the original 279 prob- lems, referred to as Original in the following subsections. We do not allow any tool usage including access to a code interpreter, as we find that many problems can be trivially solved by writing a brute-force search program. ... This describes the evaluation method and general constraints, but does not provide specific hyperparameters like learning rates, batch sizes, or optimizer settings for training or inference.