Online Preference Alignment for Language Models via Count-based Exploration

Authors: Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, Xuelong Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.
Researcher Affiliation Collaboration Chenjia Bai1,6, Yang Zhang2,1, Shuang Qiu3, Qiaosheng Zhang4, Kang Xu5, Xuelong Li1 1Institute of Artificial Intelligence (Tele AI), China Telecom, 2Tsinghua University, 3City University of Hong Kong, 4Shanghai AI Laboratory, 5Tencent AI Lab 6Shenzhen Research Institute of Northwestern Polytechnical University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Count-based Online Preference Optimization (COPO) Require: Reference model πref, preference dataset D, online iterations T, optimism coefficient α. 1: for iteration t = 1, 2, . . . , T do 2: Set Dt as the t-th portion of D and generate y πref( | x) for each prompt x in Dt. 3: Rank {y, yw, yl} with score model and obtain Dt that contains the best and worst responses. 4: Update the parameter ϑ via minfϑ Jcfn(fϑ; Dcfn) for the coin-flipping network via Eq. (18). 5: Update the LLM policy via maxπφ Jcopo(πφ; Dt) defined in Eq. (16) and set πφ πref. 6: end for
Open Source Code Yes The code is released at https://github.com/Baichenjia/COPO.
Open Datasets Yes For preference alignment of LLMs, we select Ultra Feedback 60K (Cui et al., 2023) that contains preference pairs of single-turn conversation as the preference dataset D = {x, yw, yl}.
Dataset Splits No For t-iteration of online preference alignment, we generate response y for each prompt x in Dt with the updated LLM, where Dt is the t-th portion of the whole dataset D. ... Specifically, we use a subset of the Ultra Feedback dataset with only 20% samples, then we train online DPO, SELM, and COPO for 3 iterations.
Hardware Specification Yes The training of our method is conducted on 4x A100-40G GPUs. ... Hardware 4 x NVIDIA A100 40G
Software Dependencies No The implementation of COPO build on Alignment Handbook implementation (https://github.com/huggingface/alignment-handbook), which provides the basic DPO algorithm based on the TRL Repo (https://github.com/huggingface/trl). ... We use vLLM https://github.com/vllm-project/vllm as the inference server to perform fast sampling of the LLM policy in the previous iteration.
Experiment Setup Yes We summarize the key hyperparameters in Table 4. ... Pre-training Llama-3-8B Instruct & Zephyr-7b-SFT ... Lo RA r 128 Lo RA alpha 128 Lo RA dropout 0.05 Optimizer Adamw torch Train epoch 1 Per device batch-size 2 Accelerator Deepspeed Zero3 Learning rate 5e-7 (1st iter), 3e-7 (2nd iter), 1e-7 (3rd iter) Learning rate scheduler cosine Learning rate warmup ratio 0.1 ... Exploration factor (α) 0.1 (Llama) & 0.01 (Zephyr) ... λ 0.01