Reinforcement Learning from Human Feedback with Active Queries
Authors: Kaixuan Ji, Jiafan He, Quanquan Gu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an e O(d2/ ) instance-dependent regret bound and an e O(d2/ 2) query complexity, where d is the dimension of feature space and is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternative to DPO. |
| Researcher Affiliation | Academia | Kaixuan Ji EMAIL Department of Computer Science University of California, Los Angeles; Jiafan He EMAIL Department of Computer Science University of California, Los Angeles; Quanquan Gu EMAIL Department of Computer Science University of California, Los Angeles |
| Pseudocode | Yes | Algorithm 1 Active Proximal Policy Optimization (APPO); Algorithm 2 Active Direct Preference Optimization (ADPO) |
| Open Source Code | Yes | The codes are available at https://github.com/jkx19/Active Query. |
| Open Datasets | Yes | We apply our method to train zephyr-7b-beta on Ultrafeedback-binarized dataset (Ding et al., 2023) and zephyr-7b-gemma on dpomix-7k dataset. Our experiment shows that while ADPO only make less than half numbers of queries, the model trained by ADPO achieves a comparable or better performance than DPO on our selected benckmarks including MT-Bench (Zheng et al., 2024) and Alpaca Eval 2.0. ... Specifically, we use Ultrafeedback-binarized (Ding et al., 2023) to train Zephyr-Beta-SFT and dpo-mix-7k9 to train Zephyr-Gemma-SFT. ... 9https://huggingface.co/datasets/argilla/dpo-mix-7k |
| Dataset Splits | No | The paper mentions using specific datasets (Ultrafeedback-binarized and dpo-mix-7k) and states their sizes, but does not provide explicit details about how these datasets were split into training, validation, or test sets for reproducibility. |
| Hardware Specification | Yes | We trained our models on 4 NVIDIA A100 GPUs, with about 80G memory for each GPU. ... We trained our models on 4 NVIDIA RTX A6000 GPUs, with about 49G memory for each GPU. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers, such as Python versions or library versions (e.g., PyTorch 1.9, CUDA 11.1). |
| Experiment Setup | Yes | We set the learning rate to 5e-7 for both DPO and ADPO. We use a linear learning rate scheduler with a warm-up ratio of 0.1. The batch size per device is set to 4 and the gradients are accumulated every 4 steps, resulting in equivalent batch size 64. We set dropout to 0.1 and the regularization parameter β = 0.1 for both DPO and ADPO. For both ADPO and its counterpart without pseudo labels, we set the uncertainty threshold γ = 1.3. We trained one epoch for both DPO and ADPO, which takes roughly 9 hours for both methods. |