reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SELF-EVOLVED REWARD LEARNING FOR LLMS

Authors: Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao, Zeqi Lin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted extensive experiments on multiple datasets such as HH-RLHF and Ultra Feedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs).
Researcher Affiliation	Collaboration	School of Computer Science, Fudan University School of Computer Science, Peking University Microsoft EMAIL, EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Self-Evolved Reward Learning for LLMs (SER) Input: Initial RM Rθ, unlabeled data Dunlabeled, human-labeled data Dlabeled, thresholds τlow, τhigh, τ , δ, learning rate η Output: Trained LLM policy πϕ /* Step 0: Pretrain the Reward Model */ Pretrain Rθ on the human-labeled data Dlabeled using pairwise loss Lpair; while not converged do
Open Source Code	Yes	Resources of this paper can be found at https://aka.ms/ser
Open Datasets	Yes	Our experiments explore four different preference datasets as show in Table 2. Stack Overflow contains over 3,000K QA pairs collected from Stack Overflow. Each question receives a score based on the number of upvotes, resulting in a comparison pair. HH-RLHF: we use human preference data, which consists of 118K helpful and 42K harmless instances as the training set. Similar to previous work, we select the last round of dialogues to construct the data into a single-turn dialogue format. Ultra Feedback is constructed by large language models (LLMs). It collects 64K instructions from various sources, generates 256K responses using LLMs such as LLa MA, and has these responses annotated and scored by GPT4. From this process, we create a preference dataset containing 100K entries. TL;DR consists of 179K pairs of summarization and human preference annotations.
Dataset Splits	Yes	For the preference dataset, we split the training and testing sets according to the ratio of SFT:RM:PPO = 0.3:0.65:0.05.
Hardware Specification	Yes	For smaller parameter models (e.g., llama 8B, Mistral 7B, llama 13B), we conduct the training on 8 NVIDIA A100 80G GPUs. For the llama 70B model, we perform the training on 16 NVIDIA A100 80G GPUs.
Software Dependencies	No	No specific version numbers for software dependencies were provided in the paper's main text or appendices for libraries like LoRA, AdamW, or foundational software like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We employ a learning rate of 2e-5 with cosine decay, 2 warmup steps, and a batch size of 16. We calculate the loss only for the target tokens rather than the full input sequence, and we train for 3 epochs on the training data. For PPO training, we use a learning rate of 1.4e-5 and set the generate sample length to 256. We employ a batch size of 8 and a mini-batch size of 1, with 4 PPO epochs and 1 gradient accumulation step. The target KL divergence is set to 0.1 and initial KL coefficient is set to 0.2. The thresholds τhigh, τlow, and τ were determined through extensive hyper-parameter tuning to balance precision and recall in the self-training process. Specifically, we experimented with the following values: τhigh {0.55, 0.65, 0.75} τlow {0.45, 0.35, 0.25} τ {0.3, 0.4, 0.5} After evaluating the RM s performance with these parameters, we selected τhigh = 0.55, τlow = 0.45, and τ = 0.3 as they provided the most consistent improvements in the RM s ability to self-label effectively without introducing significant error amplification.