reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Design Considerations in Offline Preference-based RL

Authors: Alekh Agarwal, Christoph Dann, Teodor Vanislavov Marinov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.
Researcher Affiliation	Industry	1Google Research. Correspondence to: Alekh Agarwal <EMAIL>, Christoph Dann <EMAIL>, Teodor V. Marinov <EMAIL>.
Pseudocode	No	The paper describes methods conceptually and provides mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	Yes	We evaluate the impact of different design choices for offline RLHF methods on the standard TL;DR summarization task (V olske et al., 2017; Stiennon et al., 2020)
Dataset Splits	No	The paper mentions using the TL;DR dataset but does not specify any particular training, validation, or test splits used for their experiments. It describes the dataset as consisting of posts and paired responses, but not how they partition this data for evaluation.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	Our experiments use a T5 large model (Raffel et al., 2020) with 770M parameters... The optimizer used is Adafactor with learning rate that is constant with a linear warm-up for 2000 steps and a base rate of 1e 4. No specific version numbers for software components or libraries are provided.
Experiment Setup	Yes	We train for 20000 iterations, with a batch size of 32. A KL regularizer is used to the reference πref checkpoint with coefficient equal to 0.005. The optimizer used is Adafactor with learning rate that is constant with a linear warm-up for 2000 steps and a base rate of 1e 4.