Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment
Authors: Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide ample empirical evidence that the proposed AIHF solution outperforms the existing alignment algorithms by large margins, especially when the data is unbalanced, where the quality and/or quantity of one data category is worse/smaller than that of the other. ... In this section, we provide numerical evaluations of the proposed method (5) (Alg. 1) and its variants (10) and (11), and comparing them with state-of-the-art methods RLHF (Ouyang et al., 2022), DPO (Rafailov et al., 2023), IPO (Calandriello et al., 2024) and SPIN (Chen et al., 2024). Our experiments demonstrate the advantages of the proposed methods... |
| Researcher Affiliation | Academia | Chenliang Li1, Siliang Zeng2 , Zeyi Liao3, Jiaxiang Li2, Dongyeop Kang2, Alfredo Garcia1, Mingyi Hong2 1Texas A&M University, College Station, 2University of Minnesota, Twin Cities, 3The Ohio State University, Columbus |
| Pseudocode | Yes | Algorithm 1: Alignment with Integrated Human Feedback (AIHF) |
| Open Source Code | No | The paper references various publicly available models and datasets (e.g., Hugging Face models, Anthropic-HH dataset) that were used in their experiments, but it does not provide any specific link or explicit statement about releasing the source code for their own proposed methodology (AIHF) or its implementation. |
| Open Datasets | Yes | Models and datasets. In the first setting, we test Alg. 1 on Anthropic-HH (Bai et al., 2022) dataset1 with (relatively small) Pythia (Biderman et al., 2023) models2 as policy models. Anthropic-HH is a preference dataset...1Dataset available at https://huggingface.co/datasets/Anthropic/hh-rlhf. 2Models available at https://huggingface.co/EleutherAI. ...We use Ultrafeedback3 as our preference dataset (61.1k preference data) and Ultrachat200k4 as the demonstration dataset (208k demonstration data), with mistral-7b-sft-beta 5 (Jiang et al., 2023) as our base model. ...3Available at https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized. 4Available at https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k. 5Available at https://huggingface.co/HuggingFaceH4/mistral-7b-sft-beta. |
| Dataset Splits | Yes | In the first setting, we test Alg. 1 on Anthropic-HH (Bai et al., 2022) dataset1... and we pick 10k chosen/preferred continuation data to form the demonstration dataset, while others serve as preference dataset and RL prompt dataset. ...Moreover, we also SFT the language model using the selected top 10k chosen responses and name it as demonstration-SFT model. ...We use Ultrafeedback3 as our preference dataset (61.1k preference data) and Ultrachat200k4 as the demonstration dataset (208k demonstration data)... A.2.1 MUJOCO TASKS: We use 10k demonstrations and 20k preferences. ... A.2.2 SENTIMENT-CONTROLLED GENERATION: ...on 30% of the training dataset for IMDb...Subsequently, we select an additional 40% of the training dataset and generate a response for each prompt for each checkpoint... and use high-quality as demonstration datasets. ...train the proposed algorithm AIHF and baselines on the remaining 30% of prompts from the training dataset. ...use 1k preference, 1k demonstration to train policy and reward model for RLHF and AIHF. |
| Hardware Specification | Yes | For the GPU resources, we use 8 A100 40G for all the experiments. ... We use eight NVIDIA A100-40G to do the training with a per-device batch size of 1 for 7B model. |
| Software Dependencies | No | The paper mentions several software tools and algorithms like Deep Speed Ze RO-3 (Rajbhandari et al., 2020) and VLLM (Kwon et al., 2023), as well as PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018). However, it does not provide specific version numbers for these tools or any other underlying software dependencies (e.g., Python, PyTorch, CUDA versions) required for reproduction. |
| Experiment Setup | Yes | We first fine-tune the language models (Pythia-160M/1B/2.8B) through supervised fine-tuning over all chosen responses from the HH dataset for 1 epoch... A.2.1 MUJOCO TASKS: ...the step size is set to 3 10 3, we parameterize the reward function by a (64, 64) MLPs with Re LU activation function. For the reward network, we use Adam as the optimizer, and the step size is set to be 1 10 4. ... A.2.4 THE RESULT OF 7B EXPERIMENTS: We train all models with bfloat16 precision. We set the learning rate to be 5e-7 for the 7B model with the cosine learning rate scheduler, and the max sequence length is set to be 512. ...a per-device batch size of 1 for 7B model. |