PAD: Personalized Alignment of LLMs at Decoding-time

Authors: Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, Zuozhu Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that PAD not only outperforms existing training-based alignment methods in terms of aligning with diverse preferences but also shows significant generalizability to preferences unseen during training and scalability across different base models.
Researcher Affiliation Academia 1 Zhejiang University 2 National University of Singapore 3 University of Washington
Pseudocode Yes To better illustrate the practical implementation of our PAD as discussed in Section 3.4, which comprises two key components: the optimization of the Personalized Reward Model (PRM) and the inference-time guided decoding with token-level personalized rewards, we have detailed these processes in Algorithms 1 and 2.
Open Source Code Yes Our model and code are available here. All code and models will be made available for reproducibility and further research.
Open Datasets Yes During the development of our personalized reward model, we utilized datasets from multiple sources including Help Steer2 (Wang et al., 2024c), Rewards-in-Context (Yang et al., 2024b), and Safe RLHF (Dai et al., 2023). The P-Soups (Jang et al., 2023) evaluation dataset has been filtered and modified based on the Koala evaluation by Jang et al. (2023). The Help Steer2 (Wang et al., 2024c) (validation split) dataset is a multi-aspect alignment dataset comprising 1,000 prompts.
Dataset Splits No The P-Soups (Jang et al., 2023) evaluation dataset has been filtered and modified based on the Koala evaluation by Jang et al. (2023). The Help Steer2 (Wang et al., 2024c) (validation split) dataset is a multi-aspect alignment dataset comprising 1,000 prompts. In the stage of personalized reward model training, we utilize training data from three datasets Help Steer2 (Wang et al., 2024c), Rewards-in-Context (Yang et al., 2024b), and Safe RLHF (Dai et al., 2023). For the Ultrafeedback and Help Steer2 datasets, we build data pairs by comparing the score annotations within the datasets.
Hardware Specification Yes The training was executed on 4 NVIDIA H100 80GB GPUs, with per-device batch size of 4. The time costs for decoding-time alignment, with results detailed in Table C2, which is measured on a single NVIDIA H100 GPU.
Software Dependencies No Our training code is based on Llama-Factory (Zheng et al., 2024). We performed model fine-tuning using the Lo RA method, specified to target all layers, utilizing a rank of 8.
Experiment Setup Yes We employ the Llama-3-8B model (AI@Meta, 2024) as our backbone, and append a linear layer directly following the embeddings, featuring an output dimension of 4096. During the decoding phase, we utilize greedy decoding with top-k candidates. We restrict the maximum lengths of the initial prompt and subsequent generations to 2,048 and 128 tokens, respectively. The hyperparameters, specifically β = 1.0 and k = 10, are optimized to maximize the average reward performance observed in our validation datasets. We performed model fine-tuning using the Lo RA method, specified to target all layers, utilizing a rank of 8. The training was executed on 4 NVIDIA H100 80GB GPUs, with per-device batch size of 4. To accommodate larger effective batch sizes, we employed 8 gradient accumulation steps. The learning rate was set at 5.0e-6, and the model was trained over 3 epochs using a cosine learning rate scheduler.