reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

Authors: Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that this dataset can enhance the mathematical reasoning performance of models across various architectures and sizes. Table 1 presents the results on six widely-used mathematical benchmarks, highlighting several key observations:
Researcher Affiliation	Industry	Microsoft EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: TCPM Calculation
Open Source Code	No	The paper uses various open-source models and repositories like LLa Ma-Factory for fine-tuning and evaluation, but it does not explicitly state that the code for its Key-Point Driven Data Synthesis (KPDDS) methodology or the generation of the KPMath dataset is publicly available or provides a specific link to their own codebase.
Open Datasets	Yes	Our training corpus was further enriched by integrating a series of mathematical reasoning datasets, leading to the creation of a comprehensive training dataset, KPMath-Plus. [...] The collection encompasses the complete datasets of Meta Math (Yu et al. 2023), MMIQC (Liu et al. 2024), and Open-Platypus (Lee, Hunter, and Ruiz 2023), in addition to the training sets of GSM8K (Cobbe et al. 2021), MATH (Hendrycks et al. 2021), TAL-SCQ5K-EN2, and Math Instruct (Yue et al. 2024). TAL-SCQ5K-EN22https://github.com/math-eval/TAL-SCQ5K
Dataset Splits	Yes	We evaluate our fine-tuned models on GSM8k (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021), along with 4 outof-distribution datasets... Data Contamination Test To mitigate data contamination risk in our benchmark, we used the method by Azerbayev et al. (2023) to scrutinize ngram overlaps between our dataset and the Math and GSM8K test sets.
Hardware Specification	No	No specific hardware details such as GPU models, CPU types, or memory amounts are mentioned in the paper.
Software Dependencies	No	The paper mentions using LLa Ma-Factory, Deep Speed Ze RO Stage3, Flash-Attention 2, and the sympy package. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup	Yes	In our supervised fine-tuning (SFT) experiments, we employed chat message templates to transform question-answer pairs into the format: User: {question}\n Enclose the final answer using \boxed{}.\n\n Assistant: {answer} . We utilized the LLa Ma-Factory repository (Zheng et al. 2024) to fine-tune the models for 3 epochs across all experiments. We adopted a linear learning rate schedule with a 3% warm-up ratio. The maximum learning rate is 1e-5, except for Deep Seek Math, which is 5e-5. We trained all models with BFloat16 numerical format... For evaluation, we adopted the same template in SFT to prompt all questions. We employed greedy decoding with a maximum sequence length of 2,048 tokens.