reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Progressively Label Enhancement for Large Language Model Alignment

Authors: Biao Liu, Ning Xu, Xin Geng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods. 6. Experiments 6.1. Experimental Configurations Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). 6.2. Main Results 6.3. Ablation Study
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Southeast University, Nanjing, China. E-mail: EMAIL. 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China.
Pseudocode	Yes	Algorithm 1 The PLE Algorithm Input: The SFT training set Dsft, a query set Dquery, the human-designed principle p, the initial base model πθ, the initial threshold τ0 and the decay factor α and the number of iteration I.
Open Source Code	No	For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4. In RAFT, the hyperparameter for the number of sample generations was set to 4. To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9. All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB.
Open Datasets	Yes	Datasets. We conducted experiments on three tasks. (1) For multi-turn dialogue task, we use Anthropic s Helpful and Harmless (HH) dataset as experimental dataset (Bai et al., 2022a). (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). (3) For summarization task, we use Reddit TL;DR summarization dataset (V olske et al., 2017).
Dataset Splits	Yes	The dataset consists of 161K training data points and 8.55K test data points. (2) For controlled text generation task, we use IMDb dataset (Maas et al., 2011). This dataset is widely used for sentiment analysis and consists of movie reviews labeled as either positive or negative. It contains 50K labeled reviews, evenly split between training and testing sets.
Hardware Specification	Yes	All experiments were conducted on 8 Huawei Ascend 910B (64GB) hardware with RAM 1000GB.
Software Dependencies	No	For implementing SFT, PPO, and DPO, we utilized the Transformer Reinforcement Learning (TRL) library 3. For RAFT, we employed the official LMflow library 4.
Experiment Setup	Yes	To save memory, we used the Parameter Efficient Fine-Tuning (PEFT) technique, specifically, Low Rank Adaptation (Lo RA) (Hu et al., 2022) with rank r = 8, scaling factor α = 16, and targeted all linear modules for all experiments. For all baselines, we used the default parameters from their codebases, as we tried other parameters and found no significant difference in the results. For PLE, we set the initial threshold τ0 = 0.2 and the decay factor α = 0.9.