Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Preference learning made easy: Everything should be understood through win rate

Authors: Lily H Zhang, Rajesh Ranganath

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We next compare WRO and non-WRO experimentally to complement the above theoretical analysis with an empirical one.
Researcher Affiliation Academia 1Center for Data Science, New York University, New York, USA 2Courant Institute, New York University, New York, USA. Correspondence to: Lily H. Zhang <EMAIL>.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Procedural steps are described in paragraph form or mathematical notation.
Open Source Code No The paper mentions the TRL library (von Werra et al., 2020) for PPO implementation but does not provide any specific link or statement about releasing their own code for the methodology described.
Open Datasets Yes We employ Pythia-2.8b (Biderman et al., 2023) as our base model and the Open Assistant (OASST) (Kopf et al., 2023) and Anthropic Helpfulness and Harmlessness (HH) (Bai et al., 2022) as datasets.
Dataset Splits No For the Open Assistant dataset (Kopf et al., 2023),... The dataset only has a train and validation split, so we split the original train set into a train and validation set and leave the validation set for testing / evaluation. ... We sample a set of 100 input prompts from the test set of a given dataset (same 100 prompts for all models) and perform win rate evaluation using the oracle judge for the dataset.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU/CPU models or memory.
Software Dependencies No The paper mentions 'Pythia-2.8b' for base models and 'PPO algorithm from the TRL library (von Werra et al., 2020)' but does not specify versions for Python, PyTorch, or the TRL library itself, which are necessary for full reproducibility.
Experiment Setup Yes To train these models, we utilize a batch size of 64 and learning rate of 5e-7 chosen based on hyperparameter sweep between [1e-8, 5e-8, 1e-7, 5e-7, 1e-6] on OASST. Following Rafailov et al. (2024c), we use the RMSProp optimizer with a learning rate warm up of 150 steps and constant learning rate schedule otherwise. ... For PPO, we use a learning rate=1e-6 (obtained through a hyperparameter sweep of [1e-7, 5e-7, 1e-6] on OASST), batch size=128, and PPOConfig defaults for all other hyperparameters.