reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

System-2 Mathematical Reasoning via Enriched Instruction Tuning

Authors: Huanqia Cai, Yijun Yang, Zhifeng Li

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods. We evaluate EIT and compare it with finetuning-based, prompting-based, and tool-augmented methods on two widely used mathematical benchmarks, MATH and GSM8K. We conduct the ablation study on MATH and GSM8K datasets based on LLa MA-2-70B in Table 3.
Researcher Affiliation	Collaboration	Huanqia Cai EMAIL Tencent Yijun Yang EMAIL Tencent University of Technology Sydney Zhifeng Li EMAIL XIntelligence Technology Co., Ltd.
Pseudocode	No	The paper describes the Enriched Instruction Tuning (EIT) method, including Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS), through textual explanations and examples in Figure 2 and Example 3.1, 3.2, and 3.3. However, it does not present these methods or any other procedures in a structured pseudocode or algorithm block.
Open Source Code	No	The paper states: 'These datasets are then used to fine-tune open-source LLMs, thereby enhancing their own ability to execute system-2 mathematical reasoning without any usage of external tools.' This refers to the use of existing open-source LLMs (like LLaMA-2) but does not provide an explicit statement or link to the source code for the authors' specific EIT methodology.
Open Datasets	Yes	Given human-annotated mathematical instruction datasets such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), the responses provided by human annotators are typically accurate yet sparse, omitting many of the implicit reasoning steps that humans adopt when solving complex tasks.
Dataset Splits	Yes	MATH dataset collects a total of 12,500 competition-level mathematics problems, which are partitioned into 7,500 for training and 5,000 for testing. Each of them is accompanied by a step-by-step solution and concludes with a distinct final answer, which is formatted for a straightforward comparison with model-generated solutions. Following the same setting as prior works, we use 7,473 and 1,319 problems for training and testing, respectively.
Hardware Specification	Yes	32 A100 GPUs are used to fine-tune the above models.
Software Dependencies	No	We use open-source LLMs LLa MA-2 as the base model for fine-tuning. GPT-4-1106-preview is used to generate enriched responses for constructing EITMath. Following the prior work (Yu et al., 2023; Xu et al., 2023), we adopt the Adam W optimizer to train the model with 3 epochs, and the learning rate is set to 2e-5. The batch size is 32 for the 70B model and 128 for the 13B model.
Experiment Setup	Yes	Following the prior work (Yu et al., 2023; Xu et al., 2023), we adopt the Adam W optimizer to train the model with 3 epochs, and the learning rate is set to 2e-5. The batch size is 32 for the 70B model and 128 for the 13B model.