reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Learning from Human Feedback with Active Queries

Authors: Kaixuan Ji, Jiafan He, Quanquan Gu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an e O(d2/ ) instance-dependent regret bound and an e O(d2/ 2) query complexity, where d is the dimension of feature space and is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternative to DPO.
Researcher Affiliation	Academia	Kaixuan Ji EMAIL Department of Computer Science University of California, Los Angeles; Jiafan He EMAIL Department of Computer Science University of California, Los Angeles; Quanquan Gu EMAIL Department of Computer Science University of California, Los Angeles
Pseudocode	Yes	Algorithm 1 Active Proximal Policy Optimization (APPO); Algorithm 2 Active Direct Preference Optimization (ADPO)
Open Source Code	Yes	The codes are available at https://github.com/jkx19/Active Query.
Open Datasets	Yes	We apply our method to train zephyr-7b-beta on Ultrafeedback-binarized dataset (Ding et al., 2023) and zephyr-7b-gemma on dpomix-7k dataset. Our experiment shows that while ADPO only make less than half numbers of queries, the model trained by ADPO achieves a comparable or better performance than DPO on our selected benckmarks including MT-Bench (Zheng et al., 2024) and Alpaca Eval 2.0. ... Specifically, we use Ultrafeedback-binarized (Ding et al., 2023) to train Zephyr-Beta-SFT and dpo-mix-7k9 to train Zephyr-Gemma-SFT. ... 9https://huggingface.co/datasets/argilla/dpo-mix-7k
Dataset Splits	No	The paper mentions using specific datasets (Ultrafeedback-binarized and dpo-mix-7k) and states their sizes, but does not provide explicit details about how these datasets were split into training, validation, or test sets for reproducibility.
Hardware Specification	Yes	We trained our models on 4 NVIDIA A100 GPUs, with about 80G memory for each GPU. ... We trained our models on 4 NVIDIA RTX A6000 GPUs, with about 49G memory for each GPU.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers, such as Python versions or library versions (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup	Yes	We set the learning rate to 5e-7 for both DPO and ADPO. We use a linear learning rate scheduler with a warm-up ratio of 0.1. The batch size per device is set to 4 and the gradients are accumulated every 4 steps, resulting in equivalent batch size 64. We set dropout to 0.1 and the regularization parameter β = 0.1 for both DPO and ADPO. For both ADPO and its counterpart without pseudo labels, we set the uncertainty threshold γ = 1.3. We trained one epoch for both DPO and ADPO, which takes roughly 9 hours for both methods.