reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning from negative feedback, or positive feedback or both

Authors: Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Springenberg, Tim Hertweck, Michael Bloesch, Rishabh Joshi, Thomas Lampe, Junhyuk Oh, Nicolas Heess, Jonas Buchli, Martin Riedmiller

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our algorithm in a variety of different settings, showcasing its utility as a general preference optimization algorithm that can deal with many different forms of preference feedback. In this section, we aim at confirming our derivations in practice to learn from only negative feedback, only positive feedback, or both. We first test it in a Bandit setting (optimizing synthetic benchmark functions) then in a setting where we transform RL on control and robotics tasks into preference optimization. And finally we showcase strong performance for RLHF of large language models.
Researcher Affiliation	Industry	Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe Nicolas Heess, Jonas Buchli, Martin Riedmiller Google Deep Mind. Corresponding Author: Abbas Abdolmaleki <EMAIL>
Pseudocode	No	The paper presents mathematical derivations, objectives, and update rules (e.g., equation 10), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or provide links to a code repository for the methodology described. It acknowledges the Gemma team for models and infrastructure but does not state their own code is open-source.
Open Datasets	Yes	Our algorithm is first evaluated on synthetic benchmarks: Rosenbrock, Sphere, and Schwefel functions (Hansen et al., 2003). We evaluate our algorithm on a range of control tasks from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020). We consider the RGB Stacking benchmark (Lee et al., 2021). Specifically, we fine-tune a Gemma 2B pre-trained model using a trained reward model (Team, 2024b) using prompts from the LMSYS-chat-1M dataset (Zheng et al., 2023).
Dataset Splits	No	The paper mentions generating samples and labeling them (e.g., "the top 2 samples are labeled as preferred, the others dis-preferred") and refers to "held-out test prompts", but it does not provide specific numerical percentages or counts for training, validation, or test splits for the overall datasets used.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. It mentions using 'models and infrastructure' from the Gemma team but no explicit hardware specifications.
Software Dependencies	No	The paper mentions using the Mu Jo Co simulator (Todorov et al., 2012) and Gemma 2B and GPT-4 models, but it does not specify version numbers for any software libraries, frameworks, or environments used in their implementation that would be necessary for reproduction.
Experiment Setup	Yes	For all the experiments, we will use a beta value of 0.5 for learning from accept&reject, 0.0 for learning from accept only, and 2.0 for learning from reject only, unless stated otherwise. Furthermore, in all experiments except experiment 5.3, the reference policy for all baselines is updated every N steps. In these experiments, we perform one epoch of training, processing a dataset of 500k prompts in approximatively 4000 learner steps, meaning that each batch is composed of 128 prompts and 4 generations per prompt.