Reinforcement Learning from Human Feedback with Active Queries

Authors: Kaixuan Ji, Jiafan He, Quanquan Gu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an e O(d2/ ) instance-dependent regret bound and an e O(d2/ 2) query complexity, where d is the dimension of feature space and is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternative to DPO.
Researcher Affiliation Academia Kaixuan Ji EMAIL Department of Computer Science University of California, Los Angeles; Jiafan He EMAIL Department of Computer Science University of California, Los Angeles; Quanquan Gu EMAIL Department of Computer Science University of California, Los Angeles
Pseudocode Yes Algorithm 1 Active Proximal Policy Optimization (APPO); Algorithm 2 Active Direct Preference Optimization (ADPO)
Open Source Code Yes The codes are available at https://github.com/jkx19/Active Query.
Open Datasets Yes We apply our method to train zephyr-7b-beta on Ultrafeedback-binarized dataset (Ding et al., 2023) and zephyr-7b-gemma on dpomix-7k dataset. Our experiment shows that while ADPO only make less than half numbers of queries, the model trained by ADPO achieves a comparable or better performance than DPO on our selected benckmarks including MT-Bench (Zheng et al., 2024) and Alpaca Eval 2.0. ... Specifically, we use Ultrafeedback-binarized (Ding et al., 2023) to train Zephyr-Beta-SFT and dpo-mix-7k9 to train Zephyr-Gemma-SFT. ... 9https://huggingface.co/datasets/argilla/dpo-mix-7k
Dataset Splits No The paper mentions using specific datasets (Ultrafeedback-binarized and dpo-mix-7k) and states their sizes, but does not provide explicit details about how these datasets were split into training, validation, or test sets for reproducibility.
Hardware Specification Yes We trained our models on 4 NVIDIA A100 GPUs, with about 80G memory for each GPU. ... We trained our models on 4 NVIDIA RTX A6000 GPUs, with about 49G memory for each GPU.
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as Python versions or library versions (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes We set the learning rate to 5e-7 for both DPO and ADPO. We use a linear learning rate scheduler with a warm-up ratio of 0.1. The batch size per device is set to 4 and the gradients are accumulated every 4 steps, resulting in equivalent batch size 64. We set dropout to 0.1 and the regularization parameter β = 0.1 for both DPO and ADPO. For both ADPO and its counterpart without pseudo labels, we set the uncertainty threshold γ = 1.3. We trained one epoch for both DPO and ADPO, which takes roughly 9 hours for both methods.