Progress or Regress? Self-Improvement Reversal in Post-training

Authors: Ting Wu, Xuefeng Li, Pengfei Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of self-improvement reversal, where models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities, like output diversity and out-of-distribution (OOD) generalization.
Researcher Affiliation Academia Ting Wu1,3, Xuefeng Li2,3, Pengfei Liu2,3 1Fudan University, 2Shanghai Jiao Tong University, 3Generative AI Research Lab (GAIR)
Pseudocode Yes Algorithm 1 Iterative Self-Improvement
Open Source Code No Our training codebase is based on LLa MA Factory (Zheng et al., 2024), and we use v LLM (Kwon et al., 2023) framework to perform inference for both Co T sampling and test evaluation.
Open Datasets Yes To measure model problem-solving capabilities, we train and test on a broad spectrum of problem-solving datasets. We measure general knowledge using the Commonsense QA (CSQA) (Talmor et al., 2019) dataset, assessing mathematical reasoning with the GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) dataset, and weigh code generation skills using the MBPP dataset (Austin et al., 2021).
Dataset Splits Yes Regarding the train-test split, we adhere to (Kojima et al., 2022), utilizing the validation set of CSQA for evaluation. The GSM8K and MATH datasets are employed with their predefined train-test splits. For the MBPP code dataset, we follow the approach outlined by (Austin et al., 2021) that utilizes examples of Task IDs 11-510 as the 500 test problems, and the remaining 374 examples ranging Task IDs from 601 to 974 for fine-tuning.
Hardware Specification Yes All training experiments are conducted on 8 NVIDIA A100 GPUs, and all experiments collectively consumed approximately 2000 A100 GPU hours.
Software Dependencies No Our training codebase is based on LLa MA Factory (Zheng et al., 2024), and we use v LLM (Kwon et al., 2023) framework to perform inference for both Co T sampling and test evaluation.
Experiment Setup Yes Detailed hyperparameters utilized throughout these experiments are documented in Table 1. Table 1: Type Parameter Value Supervised Fine-Tuning Batch Size 128 Learning Rate {LLa MA2-7B} 1e-5 Learning Rate {Mistral-7B, LLa MA3-8B} 2e-6 Learning Rate Scheduler Cosine Warm-up Ratio 0.03 Optimizer Adam W Epoch 3 Preference Fine-Tuning Batch Size 32 Learning Rate {LLa MA2-7B} 2e-6 Learning Rate {Mistral-7B, LLa MA3-8B} 2e-7 KL Coefficient (β) 0.3 Optimizer Adam W Epoch 1 Sampling Generation Temperature 0.75 Top p 0.95 Top k 50 Max tokens 512 Evaluation Generation Temperature 0 Top k -1 Max tokens 512