Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Authors: Zhaolin Gao, Wenhao Zhan, Jonathan Chang, Gokul Swamy, Kianté Brantley, Jason Lee, Wen Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms stateof-the-art methods such as DPO and REBEL across various settings. |
| Researcher Affiliation | Collaboration | Zhaolin Gao1, Wenhao Zhan2, Jonathan D. Chang3 , Gokul Swamy4, Kiant e Brantley5, Jason D. Lee2, Wen Sun1 1 Cornell University, 2 Princeton University, 3 Databricks Mosaic Research, 4 Carnegie Mellon University, 5 Harvard University |
| Pseudocode | Yes | Algorithm 1 REgressing the RElative FUtur E for reinforcement Learning (REFUEL) |
| Open Source Code | Yes | Implementation of REFUEL can be found at https://github.com/Zhaolin Gao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI. |
| Open Datasets | Yes | We evaluate REFUEL on Ultra Interact (Yuan et al., 2024), which involves the model responding to instructions with complex reasoning tasks, covering general chat scenarios. |
| Dataset Splits | Yes | Dataset % in Dataset Train/Val/Test Max Generation Length H=1 H=2 H=3 H=4 H=5 LT-OFFLINE 76.9 12.1 6.40 3.20 1.40 205K/500/500 1024 |
| Hardware Specification | Yes | The experiments are trained on 8 H100 GPUs for two hours for each iteration. |
| Software Dependencies | Yes | We perform full parameter training for Llama-3-8B-Instruct2. For Armo RM3, we directly use the reward scores without any normalizations. |
| Experiment Setup | Yes | Parameter Setting (Setting One) Method Parameters RLOO-LT-OFFLINE batch size: 128 weight decay: 1e-6 learning rate: 3e-7 schedule: cosine decay warmup ratio: 0.1 |