reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Preference Alignment for Language Models via Count-based Exploration

Authors: Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, Xuelong Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.
Researcher Affiliation	Collaboration	Chenjia Bai1,6, Yang Zhang2,1, Shuang Qiu3, Qiaosheng Zhang4, Kang Xu5, Xuelong Li1 1Institute of Artificial Intelligence (Tele AI), China Telecom, 2Tsinghua University, 3City University of Hong Kong, 4Shanghai AI Laboratory, 5Tencent AI Lab 6Shenzhen Research Institute of Northwestern Polytechnical University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Count-based Online Preference Optimization (COPO) Require: Reference model πref, preference dataset D, online iterations T, optimism coefficient α. 1: for iteration t = 1, 2, . . . , T do 2: Set Dt as the t-th portion of D and generate y πref( \| x) for each prompt x in Dt. 3: Rank {y, yw, yl} with score model and obtain Dt that contains the best and worst responses. 4: Update the parameter ϑ via minfϑ Jcfn(fϑ; Dcfn) for the coin-flipping network via Eq. (18). 5: Update the LLM policy via maxπφ Jcopo(πφ; Dt) defined in Eq. (16) and set πφ πref. 6: end for
Open Source Code	Yes	The code is released at https://github.com/Baichenjia/COPO.
Open Datasets	Yes	For preference alignment of LLMs, we select Ultra Feedback 60K (Cui et al., 2023) that contains preference pairs of single-turn conversation as the preference dataset D = {x, yw, yl}.
Dataset Splits	No	For t-iteration of online preference alignment, we generate response y for each prompt x in Dt with the updated LLM, where Dt is the t-th portion of the whole dataset D. ... Specifically, we use a subset of the Ultra Feedback dataset with only 20% samples, then we train online DPO, SELM, and COPO for 3 iterations.
Hardware Specification	Yes	The training of our method is conducted on 4x A100-40G GPUs. ... Hardware 4 x NVIDIA A100 40G
Software Dependencies	No	The implementation of COPO build on Alignment Handbook implementation (https://github.com/huggingface/alignment-handbook), which provides the basic DPO algorithm based on the TRL Repo (https://github.com/huggingface/trl). ... We use vLLM https://github.com/vllm-project/vllm as the inference server to perform fast sampling of the LLM policy in the previous iteration.
Experiment Setup	Yes	We summarize the key hyperparameters in Table 4. ... Pre-training Llama-3-8B Instruct & Zephyr-7b-SFT ... Lo RA r 128 Lo RA alpha 128 Lo RA dropout 0.05 Optimizer Adamw torch Train epoch 1 Per device batch-size 2 Accelerator Deepspeed Zero3 Learning rate 5e-7 (1st iter), 3e-7 (2nd iter), 1e-7 (3rd iter) Learning rate scheduler cosine Learning rate warmup ratio 0.1 ... Exploration factor (α) 0.1 (Llama) & 0.01 (Zephyr) ... λ 0.01