reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preference learning made easy: Everything should be understood through win rate

Authors: Lily H Zhang, Rajesh Ranganath

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We next compare WRO and non-WRO experimentally to complement the above theoretical analysis with an empirical one.
Researcher Affiliation	Academia	1Center for Data Science, New York University, New York, USA 2Courant Institute, New York University, New York, USA. Correspondence to: Lily H. Zhang <EMAIL>.
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. Procedural steps are described in paragraph form or mathematical notation.
Open Source Code	No	The paper mentions the TRL library (von Werra et al., 2020) for PPO implementation but does not provide any specific link or statement about releasing their own code for the methodology described.
Open Datasets	Yes	We employ Pythia-2.8b (Biderman et al., 2023) as our base model and the Open Assistant (OASST) (Kopf et al., 2023) and Anthropic Helpfulness and Harmlessness (HH) (Bai et al., 2022) as datasets.
Dataset Splits	No	For the Open Assistant dataset (Kopf et al., 2023),... The dataset only has a train and validation split, so we split the original train set into a train and validation set and leave the validation set for testing / evaluation. ... We sample a set of 100 input prompts from the test set of a given dataset (same 100 prompts for all models) and perform win rate evaluation using the oracle judge for the dataset.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions 'Pythia-2.8b' for base models and 'PPO algorithm from the TRL library (von Werra et al., 2020)' but does not specify versions for Python, PyTorch, or the TRL library itself, which are necessary for full reproducibility.
Experiment Setup	Yes	To train these models, we utilize a batch size of 64 and learning rate of 5e-7 chosen based on hyperparameter sweep between [1e-8, 5e-8, 1e-7, 5e-7, 1e-6] on OASST. Following Rafailov et al. (2024c), we use the RMSProp optimizer with a learning rate warm up of 150 steps and constant learning rate schedule otherwise. ... For PPO, we use a learning rate=1e-6 (obtained through a hyperparameter sweep of [1e-7, 5e-7, 1e-6] on OASST), batch size=128, and PPOConfig defaults for all other hyperparameters.