Online Preference Alignment for Language Models via Count-based Exploration
Authors: Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, Xuelong Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance. |
| Researcher Affiliation | Collaboration | Chenjia Bai1,6, Yang Zhang2,1, Shuang Qiu3, Qiaosheng Zhang4, Kang Xu5, Xuelong Li1 1Institute of Artificial Intelligence (Tele AI), China Telecom, 2Tsinghua University, 3City University of Hong Kong, 4Shanghai AI Laboratory, 5Tencent AI Lab 6Shenzhen Research Institute of Northwestern Polytechnical University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Count-based Online Preference Optimization (COPO) Require: Reference model πref, preference dataset D, online iterations T, optimism coefficient α. 1: for iteration t = 1, 2, . . . , T do 2: Set Dt as the t-th portion of D and generate y πref( | x) for each prompt x in Dt. 3: Rank {y, yw, yl} with score model and obtain Dt that contains the best and worst responses. 4: Update the parameter ϑ via minfϑ Jcfn(fϑ; Dcfn) for the coin-flipping network via Eq. (18). 5: Update the LLM policy via maxπφ Jcopo(πφ; Dt) defined in Eq. (16) and set πφ πref. 6: end for |
| Open Source Code | Yes | The code is released at https://github.com/Baichenjia/COPO. |
| Open Datasets | Yes | For preference alignment of LLMs, we select Ultra Feedback 60K (Cui et al., 2023) that contains preference pairs of single-turn conversation as the preference dataset D = {x, yw, yl}. |
| Dataset Splits | No | For t-iteration of online preference alignment, we generate response y for each prompt x in Dt with the updated LLM, where Dt is the t-th portion of the whole dataset D. ... Specifically, we use a subset of the Ultra Feedback dataset with only 20% samples, then we train online DPO, SELM, and COPO for 3 iterations. |
| Hardware Specification | Yes | The training of our method is conducted on 4x A100-40G GPUs. ... Hardware 4 x NVIDIA A100 40G |
| Software Dependencies | No | The implementation of COPO build on Alignment Handbook implementation (https://github.com/huggingface/alignment-handbook), which provides the basic DPO algorithm based on the TRL Repo (https://github.com/huggingface/trl). ... We use vLLM https://github.com/vllm-project/vllm as the inference server to perform fast sampling of the LLM policy in the previous iteration. |
| Experiment Setup | Yes | We summarize the key hyperparameters in Table 4. ... Pre-training Llama-3-8B Instruct & Zephyr-7b-SFT ... Lo RA r 128 Lo RA alpha 128 Lo RA dropout 0.05 Optimizer Adamw torch Train epoch 1 Per device batch-size 2 Accelerator Deepspeed Zero3 Learning rate 5e-7 (1st iter), 3e-7 (2nd iter), 1e-7 (3rd iter) Learning rate scheduler cosine Learning rate warmup ratio 0.1 ... Exploration factor (α) 0.1 (Llama) & 0.01 (Zephyr) ... λ 0.01 |