Design Considerations in Offline Preference-based RL
Authors: Alekh Agarwal, Christoph Dann, Teodor Vanislavov Marinov
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark. |
| Researcher Affiliation | Industry | 1Google Research. Correspondence to: Alekh Agarwal <EMAIL>, Christoph Dann <EMAIL>, Teodor V. Marinov <EMAIL>. |
| Pseudocode | No | The paper describes methods conceptually and provides mathematical formulations, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We evaluate the impact of different design choices for offline RLHF methods on the standard TL;DR summarization task (V olske et al., 2017; Stiennon et al., 2020) |
| Dataset Splits | No | The paper mentions using the TL;DR dataset but does not specify any particular training, validation, or test splits used for their experiments. It describes the dataset as consisting of posts and paired responses, but not how they partition this data for evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | Our experiments use a T5 large model (Raffel et al., 2020) with 770M parameters... The optimizer used is Adafactor with learning rate that is constant with a linear warm-up for 2000 steps and a base rate of 1e 4. No specific version numbers for software components or libraries are provided. |
| Experiment Setup | Yes | We train for 20000 iterations, with a batch size of 32. A KL regularizer is used to the reference πref checkpoint with coefficient equal to 0.005. The optimizer used is Adafactor with learning rate that is constant with a linear warm-up for 2000 steps and a base rate of 1e 4. |