reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Progress or Regress? Self-Improvement Reversal in Post-training

Authors: Ting Wu, Xuefeng Li, Pengfei Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of self-improvement reversal, where models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities, like output diversity and out-of-distribution (OOD) generalization.
Researcher Affiliation	Academia	Ting Wu1,3, Xuefeng Li2,3, Pengfei Liu2,3 1Fudan University, 2Shanghai Jiao Tong University, 3Generative AI Research Lab (GAIR)
Pseudocode	Yes	Algorithm 1 Iterative Self-Improvement
Open Source Code	No	Our training codebase is based on LLa MA Factory (Zheng et al., 2024), and we use v LLM (Kwon et al., 2023) framework to perform inference for both Co T sampling and test evaluation.
Open Datasets	Yes	To measure model problem-solving capabilities, we train and test on a broad spectrum of problem-solving datasets. We measure general knowledge using the Commonsense QA (CSQA) (Talmor et al., 2019) dataset, assessing mathematical reasoning with the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) dataset, and weigh code generation skills using the MBPP dataset (Austin et al., 2021).
Dataset Splits	Yes	Regarding the train-test split, we adhere to (Kojima et al., 2022), utilizing the validation set of CSQA for evaluation. The GSM8K and MATH datasets are employed with their predefined train-test splits. For the MBPP code dataset, we follow the approach outlined by (Austin et al., 2021) that utilizes examples of Task IDs 11-510 as the 500 test problems, and the remaining 374 examples ranging Task IDs from 601 to 974 for fine-tuning.
Hardware Specification	Yes	All training experiments are conducted on 8 NVIDIA A100 GPUs, and all experiments collectively consumed approximately 2000 A100 GPU hours.
Software Dependencies	No	Our training codebase is based on LLa MA Factory (Zheng et al., 2024), and we use v LLM (Kwon et al., 2023) framework to perform inference for both Co T sampling and test evaluation.
Experiment Setup	Yes	Detailed hyperparameters utilized throughout these experiments are documented in Table 1. Table 1: Type Parameter Value Supervised Fine-Tuning Batch Size 128 Learning Rate {LLa MA2-7B} 1e-5 Learning Rate {Mistral-7B, LLa MA3-8B} 2e-6 Learning Rate Scheduler Cosine Warm-up Ratio 0.03 Optimizer Adam W Epoch 3 Preference Fine-Tuning Batch Size 32 Learning Rate {LLa MA2-7B} 2e-6 Learning Rate {Mistral-7B, LLa MA3-8B} 2e-7 KL Coefficient (β) 0.3 Optimizer Adam W Epoch 1 Sampling Generation Temperature 0.75 Top p 0.95 Top k 50 Max tokens 512 Evaluation Generation Temperature 0 Top k -1 Max tokens 512