Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, Sim PO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. |
| Researcher Affiliation | Collaboration | 1Wangxuan Institute of Computer Technology, Peking University 2Ant Group 3National Key Laboratory of General Artificial Intelligence. Correspondence to: Wei Wu <EMAIL>, Dongyan Zhao <EMAIL>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It describes the methodology in text and mathematical formulations but lacks a distinct pseudocode section. |
| Open Source Code | No | The paper states, "We leverage the Open RLHF library (Hu et al., 2024) for model training," which refers to a third-party tool used by the authors, not their own source code for the methodology described in this paper. |
| Open Datasets | Yes | We utilize the widely-adopted Ultra Feedback dataset (Cui et al., 2023) in experiments. The dataset is a comprehensive collection of user preferences spanning diverse domains. It contains 63,967 instances from 6 publicly available datasets, including Truthful QA, False QA, Evol Instruct, Ultra Chat, Share GPT, and FLAN. ... (1) Commonsense Reasoning: we employ ARC-challenge and ARC-easy (Clark et al., 2018) as the evaluation datasets. (2) Mathematical Reasoning: GSM8K (Cobbe et al., 2021), a collection of grade-school problems, is exploited for evaluation. (3) Truthfulness: we use Truthful QA (Lin et al., 2022) to assess the honesty of aligned LLMs. |
| Dataset Splits | Yes | We randomly sample 1,000 instances for validation and an additional 1,000 instances for testing. The rest of the instances are used for training LPC and the baseline alignment methods. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions "We leverage the Open RLHF library (Hu et al., 2024) for model training. All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler..." However, it does not specify version numbers for these software components or any other key libraries. |
| Experiment Setup | Yes | All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler peaking at 5e-7 with a 10% warm-up phase. The global batch size is set to 64 and the max length is 1,024. For LPC, we search λ in Eq.8 from {0.01, 0.05, 0.1} and find λ = 0.05 yields good performance across all methods. For the DPO and Sim PO methods, we regulate the deviation from the reference model by setting β in Eq. 5 and Eq. 13 to 0.1. In the case of IPO, we explore the optimal τ value in Eq. 14 from {0.01, 0.05, 0.1, 0.5} based on the validation performance and empirically choose τ = 0.01. |