reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as Open AI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities.
Researcher Affiliation	Collaboration	1Tsinghua University. E-mail: EMAIL 2Sun-Yat Sen University 3School of Mathematical Science, Fudan University 4ARC Lab, Arizona State University 5Bytedance Inc. 6Peng Cheng Laboratory 7INFLY TECH (Shanghai) Co., Ltd. 8Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology, 9University of Illinois Chicago.
Pseudocode	No	The paper describes methods like data engineering and refinement, and outlines evaluation metrics, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	we propose COUNTERMATH 1, a counterexample-based mathematical reasoning benchmark. 1https://github.com/THUKElab/COUNTERMATH
Open Datasets	Yes	we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. [...] 1https://github.com/THUKElab/COUNTERMATH
Dataset Splits	No	The paper mentions that COUNTERMATH consists of 1,216 data samples and that 1,025 samples were obtained for training, but it does not specify any explicit training, validation, or test splits for these datasets within the paper's methodology.
Hardware Specification	Yes	For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5.
Software Dependencies	No	The paper mentions using specific models like Qwen2.5-Math-Instruct-7B and LoRA training, but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5.