On-the-fly Preference Alignment via Principle-Guided Decoding

Authors: Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS. To comprehensively evaluate the effect of the proposed OPAD, we focus on general alignment and personalized alignment tasks. For general alignment, we use two widely employed datasets in RLHF study: HH-RLHF... and Summarization dataset... For personalized alignment, we leverage the Domain-Specific Preference (DSP) dataset... and the P-soups dataset... We calculate Perplexity (PPL) using GPT2 as an oracle model to assess the fluency and coherency in the dialogue task and ROUGE score to evaluate the resemblance to human-written summaries with Mistral as the base model. Additionally, we report the Distinct-1 and Distinct-2 metrics to measure the diversity of the model s generations. Table 1 shows direct comparison of OPAD with the baselines on general alignment tasks.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2State Key Laboratory of Communication Content Cognition, People s Daily Online EMAIL, EMAIL EMAIL, EMAIL
Pseudocode Yes Algorithm 1 OPAD-guided decoding. Input: Query x, base policy πθ, principle c 1: Get the constrained and unconstrained probability distribution πθ(yt|x, c, y<t) and πθ(yt|x, y<t) for the current time step t 2: Estimate the reward rθ(x, y<t, c) according to Equation 4 3: Modify the base policy using the reward to form the principle-guided policy pθ(yt|x, y<t, c) based on Equation 6 4: Sample yt pθ(yt|x, y<t, c) 5: return yt
Open Source Code Yes 1Code can be found at: https://github.com/stevie1023/OPAD.git.
Open Datasets Yes Datasets. To comprehensively evaluate the effect of the proposed OPAD, we focus on general alignment and personalized alignment tasks. For general alignment, we use two widely employed datasets in RLHF study: HH-RLHF, a human-labeled preference dataset on helpfulness and harmlessness from Bai et al. (2022) and Summarization dataset from Stiennon et al. (2020). For personalized alignment, we leverage the Domain-Specific Preference (DSP) dataset (Cheng et al., 2023), which is composed of domain-specific preferences from the four typical domains: Academy, Business, Entertainment, and Literature, and the P-soups dataset from PERSONALIZED SOUPS Jang et al. (2023).
Dataset Splits Yes We randomly sample 400 samples for each dataset during evaluation.
Hardware Specification Yes The experiments were conducted on 2 A800 GPUs, where we recorded both the generation speed (time required to generate one token) and the peak memory consumption for vanilla generation and OPAD.
Software Dependencies No The paper mentions several language models (e.g., Vicuna-7B-v1.5, Mistral-7B-Instruct, GPT4-Turbo, GPT2, Llama-3.2-1B-Instruct) used as base models or for evaluation, but does not provide specific version numbers for any ancillary software libraries or frameworks used in their implementation.
Experiment Setup Yes Experimental details. We set β to 1.0 for general alignment tasks and 2.0 for personalized alignment datasets. We apply greedy decoding to generate the responses and evaluate the performance by directly comparing the OPAD and baseline methods using GPT4-Turbo, with the evaluation prompts for each task in Appendix G. For Bo N, we set N to 16. For ICL, we use 5 shots. We randomly sample 400 samples for each dataset during evaluation.