Progressively Label Enhancement for Large Language Model Alignment

Authors: Biao Liu, Ning Xu, Xin Geng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods. 6. Experiments 6.1. Experimental Configurations Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). 6.2. Main Results 6.3. Ablation Study
Researcher Affiliation Academia 1School of Computer Science and Engineering, Southeast University, Nanjing, China. E-mail: EMAIL. 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China.
Pseudocode Yes Algorithm 1 The PLE Algorithm Input: The SFT training set Dsft, a query set Dquery, the human-designed principle p, the initial base model πθ, the initial threshold τ0 and the decay factor α and the number of iteration I.
Open Source Code No For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4. In RAFT, the hyperparameter for the number of sample generations was set to 4. To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9. All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB.
Open Datasets Yes Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). (3) For summarization task, we use Reddit TL;DR summarization dataset (V olske et al., 2017).
Dataset Splits Yes The dataset consists of 161K training data points and 8.55K test data points. (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). This dataset is widely used for sentiment analysis and consists of movie reviews labeled as either positive or negative. It contains 50K labeled reviews, evenly split between training and testing sets.
Hardware Specification Yes All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB.
Software Dependencies No For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4.
Experiment Setup Yes To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9.