Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations
Authors: Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, Mengdi Wang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe significant performance drops on MATH-P-Hard across various models, including o1-mini ( 16.49%) and gemini-2.0flash-thinking ( 12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available here. |
| Researcher Affiliation | Collaboration | Kaixuan Huang 1 Jiacheng Guo 1 Zihao Li 1 Xiang Ji 1 Jiawei Ge 1 Wenzhe Li 1 Yingqing Guo 1 Tianle Cai 1 Hui Yuan 1 Runzhe Wang 1 Yue Wu 1 Ming Yin 1 Shange Tang 1 Yangsibo Huang 2 Chi Jin 1 Xinyun Chen 2 Chiyuan Zhang 2 Mengdi Wang 1 1Princeton University 2Google. |
| Pseudocode | No | The paper describes methods and problem-solving steps but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The project is available here. |
| Open Datasets | Yes | We choose the popular MATH benchmark (Hendrycks et al., 2021), which contains challenging mathematical reasoning problems sourced from American high school mathematics competitions such as the AMC 10, AMC 12, and AIME. ... We design and construct MATH-P-Simple (simple perturbation) and MATH-P-Hard (hard perturbation), each consisting of 279 perturbed math problems that originate from the level-5 (hardest) problems of the MATH dataset (Hendrycks et al., 2021). ... The project is available here. |
| Dataset Splits | Yes | We use level-5 problems from both the train split and the test split as the seed problems, so we are able to investigate whether language models behave differently on the two splits. ... After removing several annotations that failed the quality checks, we obtained 279 pairs of modifications, where 164 examples are from train split and 115 examples are from test split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only general statements about evaluating LLMs. |
| Software Dependencies | No | The paper mentions using the 'sympy package' for checking equivalence, but does not specify its version number. While it lists models and their versions in Appendix A, these are the subjects of evaluation, not ancillary software dependencies in the sense of development tools or libraries. |
| Experiment Setup | No | We adopt zero-shot chain-of-thought (Co T) (Wei et al., 2022; Kojima et al., 2022) as the standard evaluation method on our benchmarks. For comparison, we also evaluate the models on the set of the original 279 prob- lems, referred to as Original in the following subsections. We do not allow any tool usage including access to a code interpreter, as we find that many problems can be trivially solved by writing a brute-force search program. ... This describes the evaluation method and general constraints, but does not provide specific hyperparameters like learning rates, batch sizes, or optimizer settings for training or inference. |