Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

Authors: Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. Table 1 presents the results on six widely-used mathematical benchmarks, highlighting several key observations:
Researcher Affiliation Industry Microsoft EMAIL, EMAIL
Pseudocode Yes Algorithm 1: TCPM Calculation
Open Source Code No The paper uses various open-source models and repositories like LLa Ma-Factory for fine-tuning and evaluation, but it does not explicitly state that the code for its Key-Point Driven Data Synthesis (KPDDS) methodology or the generation of the KPMath dataset is publicly available or provides a specific link to their own codebase.
Open Datasets Yes Our training corpus was further enriched by integrating a series of mathematical reasoning datasets, leading to the creation of a comprehensive training dataset, KPMath-Plus. [...] The collection encompasses the complete datasets of Meta Math (Yu et al. 2023), MMIQC (Liu et al. 2024), and Open-Platypus (Lee, Hunter, and Ruiz 2023), in addition to the training sets of GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021), TAL-SCQ5K-EN2, and Math Instruct (Yue et al. 2024). TAL-SCQ5K-EN22https://github.com/math-eval/TAL-SCQ5K
Dataset Splits Yes We evaluate our fine-tuned models on GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021), along with 4 outof-distribution datasets... Data Contamination Test To mitigate data contamination risk in our benchmark, we used the method by Azerbayev et al. (2023) to scrutinize ngram overlaps between our dataset and the Math and GSM8K test sets.
Hardware Specification No No specific hardware details such as GPU models, CPU types, or memory amounts are mentioned in the paper.
Software Dependencies No The paper mentions using LLa Ma-Factory, Deep Speed Ze RO Stage3, Flash-Attention 2, and the sympy package. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes In our supervised fine-tuning (SFT) experiments, we employed chat message templates to transform question-answer pairs into the format: User: {question}\n Enclose the final answer using \boxed{}.\n\n Assistant: {answer} . We utilized the LLa Ma-Factory repository (Zheng et al. 2024) to fine-tune the models for 3 epochs across all experiments. We adopted a linear learning rate schedule with a 3% warm-up ratio. The maximum learning rate is 1e-5, except for Deep Seek Math, which is 5e-5. We trained all models with BFloat16 numerical format... For evaluation, we adopted the same template in SFT to prompt all questions. We employed greedy decoding with a maximum sequence length of 2,048 tokens.