One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as Open AI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities.
Researcher Affiliation Collaboration 1Tsinghua University. E-mail: EMAIL 2Sun-Yat Sen University 3School of Mathematical Science, Fudan University 4ARC Lab, Arizona State University 5Bytedance Inc. 6Peng Cheng Laboratory 7INFLY TECH (Shanghai) Co., Ltd. 8Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology, 9University of Illinois Chicago.
Pseudocode No The paper describes methods like data engineering and refinement, and outlines evaluation metrics, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes we propose COUNTERMATH 1, a counterexample-based mathematical reasoning benchmark. 1https://github.com/THUKElab/COUNTERMATH
Open Datasets Yes we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. [...] 1https://github.com/THUKElab/COUNTERMATH
Dataset Splits No The paper mentions that COUNTERMATH consists of 1,216 data samples and that 1,025 samples were obtained for training, but it does not specify any explicit training, validation, or test splits for these datasets within the paper's methodology.
Hardware Specification Yes For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5.
Software Dependencies No The paper mentions using specific models like Qwen2.5-Math-Instruct-7B and LoRA training, but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5.