System-2 Mathematical Reasoning via Enriched Instruction Tuning

Authors: Huanqia Cai, Yijun Yang, Zhifeng Li

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods. We evaluate EIT and compare it with finetuning-based, prompting-based, and tool-augmented methods on two widely used mathematical benchmarks, MATH and GSM8K. We conduct the ablation study on MATH and GSM8K datasets based on LLa MA-2-70B in Table 3.
Researcher Affiliation Collaboration Huanqia Cai EMAIL Tencent Yijun Yang EMAIL Tencent University of Technology Sydney Zhifeng Li EMAIL XIntelligence Technology Co., Ltd.
Pseudocode No The paper describes the Enriched Instruction Tuning (EIT) method, including Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS), through textual explanations and examples in Figure 2 and Example 3.1, 3.2, and 3.3. However, it does not present these methods or any other procedures in a structured pseudocode or algorithm block.
Open Source Code No The paper states: 'These datasets are then used to fine-tune open-source LLMs, thereby enhancing their own ability to execute system-2 mathematical reasoning without any usage of external tools.' This refers to the use of *existing* open-source LLMs (like LLaMA-2) but does not provide an explicit statement or link to the source code for the authors' specific EIT methodology.
Open Datasets Yes Given human-annotated mathematical instruction datasets such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), the responses provided by human annotators are typically accurate yet sparse, omitting many of the implicit reasoning steps that humans adopt when solving complex tasks.
Dataset Splits Yes MATH dataset collects a total of 12,500 competition-level mathematics problems, which are partitioned into 7,500 for training and 5,000 for testing. Each of them is accompanied by a step-by-step solution and concludes with a distinct final answer, which is formatted for a straightforward comparison with model-generated solutions. Following the same setting as prior works, we use 7,473 and 1,319 problems for training and testing, respectively.
Hardware Specification Yes 32 A100 GPUs are used to fine-tune the above models.
Software Dependencies No We use open-source LLMs LLa MA-2 as the base model for fine-tuning. GPT-4-1106-preview is used to generate enriched responses for constructing EITMath. Following the prior work (Yu et al., 2023; Xu et al., 2023), we adopt the Adam W optimizer to train the model with 3 epochs, and the learning rate is set to 2e-5. The batch size is 32 for the 70B model and 128 for the 13B model.
Experiment Setup Yes Following the prior work (Yu et al., 2023; Xu et al., 2023), we adopt the Adam W optimizer to train the model with 3 epochs, and the learning rate is set to 2e-5. The batch size is 32 for the 70B model and 128 for the 13B model.