reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When Can Proxies Improve the Sample Complexity of Preference Learning?

Authors: Yuchen Zhu, Daniel Augusto De Souza, Zhengyan Shi, Mengyue Yang, Pasquale Minervini, Matt Kusner, Alexander D’Amour

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	While this work focuses on theory and verifies the claims through mathematical proofs, we provide in Appendix D a small-scale experiment on over-smoothing in reward learning, as well as an empirical validation of Theorem 5 and Theorem 6.
Researcher Affiliation	Collaboration	1Department of Computer Science, University College London 2University of Bristol 3University of Edinburgh 4Polytechnique Montréal 5Mila Quebec AI Institute 6Google Deepmind.
Pseudocode	No	The paper describes the model parameterisation and learning procedure in Section 4.3 and Appendix B, but it is presented in narrative text and mathematical equations rather than structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	No	The experiments in Appendix D use synthetically generated data: 'X = R5, PX = N(0, I5)' and 'Y = {1, 2, 3}'. This is a description of the data distribution, not a publicly accessible dataset. There are no links, DOIs, or citations to specific external datasets.
Dataset Splits	Yes	We train the proxy policy model on 8000 proxy samples {( xi, yw,i, yl,i)}8000 i=1 generated from π, then only finetune π from 35 true samples n x j, y w,j, y l,j j=1 generated from π .
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running the experiments. It only mentions a general 'environment'.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers. It mentions neural networks but no specific frameworks or libraries.
Experiment Setup	Yes	π is initialised as a neural network with two linear layers followed by an injective softmax layer; all weights and biases are sampled from Uniform 1/input features, 1/input features. The true policy is initialised by scaling the logits layer of π by T. Temperature T = 5. The components τ , Θ, π, ϕ are parameterised as neural networks such that the Lipschitz constants are all 1, in other words, Θ 2 = 1, L ϕ = 1 and L π = 1.