One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
Authors: Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as Open AI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. |
| Researcher Affiliation | Collaboration | 1Tsinghua University. E-mail: EMAIL 2Sun-Yat Sen University 3School of Mathematical Science, Fudan University 4ARC Lab, Arizona State University 5Bytedance Inc. 6Peng Cheng Laboratory 7INFLY TECH (Shanghai) Co., Ltd. 8Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology, 9University of Illinois Chicago. |
| Pseudocode | No | The paper describes methods like data engineering and refinement, and outlines evaluation metrics, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | we propose COUNTERMATH 1, a counterexample-based mathematical reasoning benchmark. 1https://github.com/THUKElab/COUNTERMATH |
| Open Datasets | Yes | we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. [...] 1https://github.com/THUKElab/COUNTERMATH |
| Dataset Splits | No | The paper mentions that COUNTERMATH consists of 1,216 data samples and that 1,025 samples were obtained for training, but it does not specify any explicit training, validation, or test splits for these datasets within the paper's methodology. |
| Hardware Specification | Yes | For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5. |
| Software Dependencies | No | The paper mentions using specific models like Qwen2.5-Math-Instruct-7B and LoRA training, but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For model training, we select Qwen-2.5-Math-7B-Instruct, an open-source model known for its strong mathematical reasoning capabilities and general applicability. we fine-tune Qwen2.5-Math-Instruct-7B using supervised Lo RA training on 2 L20 48GB GPUs, with a learning rate of 1.0e-5. |