Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, Sim PO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data.
Researcher Affiliation Collaboration 1Wangxuan Institute of Computer Technology, Peking University 2Ant Group 3National Key Laboratory of General Artificial Intelligence. Correspondence to: Wei Wu <EMAIL>, Dongyan Zhao <EMAIL>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It describes the methodology in text and mathematical formulations but lacks a distinct pseudocode section.
Open Source Code No The paper states, "We leverage the Open RLHF library (Hu et al., 2024) for model training," which refers to a third-party tool used by the authors, not their own source code for the methodology described in this paper.
Open Datasets Yes We utilize the widely-adopted Ultra Feedback dataset (Cui et al., 2023) in experiments. The dataset is a comprehensive collection of user preferences spanning diverse domains. It contains 63,967 instances from 6 publicly available datasets, including Truthful QA, False QA, Evol Instruct, Ultra Chat, Share GPT, and FLAN. ... (1) Commonsense Reasoning: we employ ARC-challenge and ARC-easy (Clark et al., 2018) as the evaluation datasets. (2) Mathematical Reasoning: GSM8K (Cobbe et al., 2021), a collection of grade-school problems, is exploited for evaluation. (3) Truthfulness: we use Truthful QA (Lin et al., 2022) to assess the honesty of aligned LLMs.
Dataset Splits Yes We randomly sample 1,000 instances for validation and an additional 1,000 instances for testing. The rest of the instances are used for training LPC and the baseline alignment methods.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions "We leverage the Open RLHF library (Hu et al., 2024) for model training. All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler..." However, it does not specify version numbers for these software components or any other key libraries.
Experiment Setup Yes All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler peaking at 5e-7 with a 10% warm-up phase. The global batch size is set to 64 and the max length is 1,024. For LPC, we search λ in Eq.8 from {0.01, 0.05, 0.1} and find λ = 0.05 yields good performance across all methods. For the DPO and Sim PO methods, we regulate the deviation from the reference model by setting β in Eq. 5 and Eq. 13 to 0.1. In the case of IPO, we explore the optimal τ value in Eq. 14 from {0.01, 0.05, 0.1, 0.5} based on the validation performance and empirically choose τ = 0.01.