reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On-the-fly Preference Alignment via Principle-Guided Decoding

Authors: Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS. To comprehensively evaluate the effect of the proposed OPAD, we focus on general alignment and personalized alignment tasks. For general alignment, we use two widely employed datasets in RLHF study: HH-RLHF... and Summarization dataset... For personalized alignment, we leverage the Domain-Specific Preference (DSP) dataset... and the P-soups dataset... We calculate Perplexity (PPL) using GPT2 as an oracle model to assess the fluency and coherency in the dialogue task and ROUGE score to evaluate the resemblance to human-written summaries with Mistral as the base model. Additionally, we report the Distinct-1 and Distinct-2 metrics to measure the diversity of the model s generations. Table 1 shows direct comparison of OPAD with the baselines on general alignment tasks.
Researcher Affiliation	Collaboration	1University of Science and Technology of China 2State Key Laboratory of Communication Content Cognition, People s Daily Online EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 OPAD-guided decoding. Input: Query x, base policy πθ, principle c 1: Get the constrained and unconstrained probability distribution πθ(yt\|x, c, y<t) and πθ(yt\|x, y<t) for the current time step t 2: Estimate the reward rθ(x, y<t, c) according to Equation 4 3: Modify the base policy using the reward to form the principle-guided policy pθ(yt\|x, y<t, c) based on Equation 6 4: Sample yt pθ(yt\|x, y<t, c) 5: return yt
Open Source Code	Yes	1Code can be found at: https://github.com/stevie1023/OPAD.git.
Open Datasets	Yes	Datasets. To comprehensively evaluate the effect of the proposed OPAD, we focus on general alignment and personalized alignment tasks. For general alignment, we use two widely employed datasets in RLHF study: HH-RLHF, a human-labeled preference dataset on helpfulness and harmlessness from Bai et al. (2022) and Summarization dataset from Stiennon et al. (2020). For personalized alignment, we leverage the Domain-Specific Preference (DSP) dataset (Cheng et al., 2023), which is composed of domain-specific preferences from the four typical domains: Academy, Business, Entertainment, and Literature, and the P-soups dataset from PERSONALIZED SOUPS Jang et al. (2023).
Dataset Splits	Yes	We randomly sample 400 samples for each dataset during evaluation.
Hardware Specification	Yes	The experiments were conducted on 2 A800 GPUs, where we recorded both the generation speed (time required to generate one token) and the peak memory consumption for vanilla generation and OPAD.
Software Dependencies	No	The paper mentions several language models (e.g., Vicuna-7B-v1.5, Mistral-7B-Instruct, GPT4-Turbo, GPT2, Llama-3.2-1B-Instruct) used as base models or for evaluation, but does not provide specific version numbers for any ancillary software libraries or frameworks used in their implementation.
Experiment Setup	Yes	Experimental details. We set β to 1.0 for general alignment tasks and 2.0 for personalized alignment datasets. We apply greedy decoding to generate the responses and evaluate the performance by directly comparing the OPAD and baseline methods using GPT4-Turbo, with the evaluation prompts for each task in Appendix G. For Bo N, we set N to 16. For ICL, we use 5 shots. We randomly sample 400 samples for each dataset during evaluation.