Progressively Label Enhancement for Large Language Model Alignment
Authors: Biao Liu, Ning Xu, Xin Geng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods. 6. Experiments 6.1. Experimental Configurations Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). 6.2. Main Results 6.3. Ablation Study |
| Researcher Affiliation | Academia | 1School of Computer Science and Engineering, Southeast University, Nanjing, China. E-mail: EMAIL. 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. |
| Pseudocode | Yes | Algorithm 1 The PLE Algorithm Input: The SFT training set Dsft, a query set Dquery, the human-designed principle p, the initial base model πθ, the initial threshold τ0 and the decay factor α and the number of iteration I. |
| Open Source Code | No | For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4. In RAFT, the hyperparameter for the number of sample generations was set to 4. To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9. All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB. |
| Open Datasets | Yes | Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). (3) For summarization task, we use Reddit TL;DR summarization dataset (V olske et al., 2017). |
| Dataset Splits | Yes | The dataset consists of 161K training data points and 8.55K test data points. (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). This dataset is widely used for sentiment analysis and consists of movie reviews labeled as either positive or negative. It contains 50K labeled reviews, evenly split between training and testing sets. |
| Hardware Specification | Yes | All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB. |
| Software Dependencies | No | For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4. |
| Experiment Setup | Yes | To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9. |