reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, Sim PO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data.
Researcher Affiliation	Collaboration	1Wangxuan Institute of Computer Technology, Peking University 2Ant Group 3National Key Laboratory of General Artificial Intelligence. Correspondence to: Wei Wu <EMAIL>, Dongyan Zhao <EMAIL>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It describes the methodology in text and mathematical formulations but lacks a distinct pseudocode section.
Open Source Code	No	The paper states, "We leverage the Open RLHF library (Hu et al., 2024) for model training," which refers to a third-party tool used by the authors, not their own source code for the methodology described in this paper.
Open Datasets	Yes	We utilize the widely-adopted Ultra Feedback dataset (Cui et al., 2023) in experiments. The dataset is a comprehensive collection of user preferences spanning diverse domains. It contains 63,967 instances from 6 publicly available datasets, including Truthful QA, False QA, Evol Instruct, Ultra Chat, Share GPT, and FLAN. ... (1) Commonsense Reasoning: we employ ARC-challenge and ARC-easy (Clark et al., 2018) as the evaluation datasets. (2) Mathematical Reasoning: GSM8K (Cobbe et al., 2021), a collection of grade-school problems, is exploited for evaluation. (3) Truthfulness: we use Truthful QA (Lin et al., 2022) to assess the honesty of aligned LLMs.
Dataset Splits	Yes	We randomly sample 1,000 instances for validation and an additional 1,000 instances for testing. The rest of the instances are used for training LPC and the baseline alignment methods.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions "We leverage the Open RLHF library (Hu et al., 2024) for model training. All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler..." However, it does not specify version numbers for these software components or any other key libraries.
Experiment Setup	Yes	All models are trained for one epoch, employing the Adam W optimizer (Loshchilov, 2017) and a linear learning rate scheduler peaking at 5e-7 with a 10% warm-up phase. The global batch size is set to 64 and the max length is 1,024. For LPC, we search λ in Eq.8 from {0.01, 0.05, 0.1} and find λ = 0.05 yields good performance across all methods. For the DPO and Sim PO methods, we regulate the deviation from the reference model by setting β in Eq. 5 and Eq. 13 to 0.1. In the case of IPO, we explore the optimal τ value in Eq. 14 from {0.01, 0.05, 0.1, 0.5} based on the validation performance and empirically choose τ = 0.01.