SELF-EVOLVED REWARD LEARNING FOR LLMS
Authors: Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments on multiple datasets such as HH-RLHF and Ultra Feedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs). |
| Researcher Affiliation | Collaboration | School of Computer Science, Fudan University School of Computer Science, Peking University Microsoft EMAIL, EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Self-Evolved Reward Learning for LLMs (SER) Input: Initial RM Rθ, unlabeled data Dunlabeled, human-labeled data Dlabeled, thresholds τlow, τhigh, τ , δ, learning rate η Output: Trained LLM policy πϕ /* Step 0: Pretrain the Reward Model */ Pretrain Rθ on the human-labeled data Dlabeled using pairwise loss Lpair; while not converged do |
| Open Source Code | Yes | Resources of this paper can be found at https://aka.ms/ser |
| Open Datasets | Yes | Our experiments explore four different preference datasets as show in Table 2. Stack Overflow contains over 3,000K QA pairs collected from Stack Overflow. Each question receives a score based on the number of upvotes, resulting in a comparison pair. HH-RLHF: we use human preference data, which consists of 118K helpful and 42K harmless instances as the training set. Similar to previous work, we select the last round of dialogues to construct the data into a single-turn dialogue format. Ultra Feedback is constructed by large language models (LLMs). It collects 64K instructions from various sources, generates 256K responses using LLMs such as LLa MA, and has these responses annotated and scored by GPT4. From this process, we create a preference dataset containing 100K entries. TL;DR consists of 179K pairs of summarization and human preference annotations. |
| Dataset Splits | Yes | For the preference dataset, we split the training and testing sets according to the ratio of SFT:RM:PPO = 0.3:0.65:0.05. |
| Hardware Specification | Yes | For smaller parameter models (e.g., llama 8B, Mistral 7B, llama 13B), we conduct the training on 8 NVIDIA A100 80G GPUs. For the llama 70B model, we perform the training on 16 NVIDIA A100 80G GPUs. |
| Software Dependencies | No | No specific version numbers for software dependencies were provided in the paper's main text or appendices for libraries like LoRA, AdamW, or foundational software like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We employ a learning rate of 2e-5 with cosine decay, 2 warmup steps, and a batch size of 16. We calculate the loss only for the target tokens rather than the full input sequence, and we train for 3 epochs on the training data. For PPO training, we use a learning rate of 1.4e-5 and set the generate sample length to 256. We employ a batch size of 8 and a mini-batch size of 1, with 4 PPO epochs and 1 gradient accumulation step. The target KL divergence is set to 0.1 and initial KL coefficient is set to 0.2. The thresholds τhigh, τlow, and τ were determined through extensive hyper-parameter tuning to balance precision and recall in the self-training process. Specifically, we experimented with the following values: τhigh {0.55, 0.65, 0.75} τlow {0.45, 0.35, 0.25} τ {0.3, 0.4, 0.5} After evaluating the RM s performance with these parameters, we selected τhigh = 0.55, τlow = 0.45, and τ = 0.3 as they provided the most consistent improvements in the RM s ability to self-label effectively without introducing significant error amplification. |