When Can Proxies Improve the Sample Complexity of Preference Learning?

Authors: Yuchen Zhu, Daniel Augusto De Souza, Zhengyan Shi, Mengyue Yang, Pasquale Minervini, Matt Kusner, Alexander D’Amour

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental While this work focuses on theory and verifies the claims through mathematical proofs, we provide in Appendix D a small-scale experiment on over-smoothing in reward learning, as well as an empirical validation of Theorem 5 and Theorem 6.
Researcher Affiliation Collaboration 1Department of Computer Science, University College London 2University of Bristol 3University of Edinburgh 4Polytechnique Montréal 5Mila Quebec AI Institute 6Google Deepmind.
Pseudocode No The paper describes the model parameterisation and learning procedure in Section 4.3 and Appendix B, but it is presented in narrative text and mathematical equations rather than structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets No The experiments in Appendix D use synthetically generated data: 'X = R5, PX = N(0, I5)' and 'Y = {1, 2, 3}'. This is a description of the data distribution, not a publicly accessible dataset. There are no links, DOIs, or citations to specific external datasets.
Dataset Splits Yes We train the proxy policy model on 8000 proxy samples {( xi, yw,i, yl,i)}8000 i=1 generated from π, then only finetune π from 35 true samples n x j, y w,j, y l,j j=1 generated from π .
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running the experiments. It only mentions a general 'environment'.
Software Dependencies No The paper does not specify any software dependencies with version numbers. It mentions neural networks but no specific frameworks or libraries.
Experiment Setup Yes π is initialised as a neural network with two linear layers followed by an injective softmax layer; all weights and biases are sampled from Uniform 1/input features, 1/input features. The true policy is initialised by scaling the logits layer of π by T. Temperature T = 5. The components τ , Θ, π, ϕ are parameterised as neural networks such that the Lipschitz constants are all 1, in other words, Θ 2 = 1, L ϕ = 1 and L π = 1.