Learning from negative feedback, or positive feedback or both

Authors: Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Springenberg, Tim Hertweck, Michael Bloesch, Rishabh Joshi, Thomas Lampe, Junhyuk Oh, Nicolas Heess, Jonas Buchli, Martin Riedmiller

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm in a variety of different settings, showcasing its utility as a general preference optimization algorithm that can deal with many different forms of preference feedback. In this section, we aim at confirming our derivations in practice to learn from only negative feedback, only positive feedback, or both. We first test it in a Bandit setting (optimizing synthetic benchmark functions) then in a setting where we transform RL on control and robotics tasks into preference optimization. And finally we showcase strong performance for RLHF of large language models.
Researcher Affiliation Industry Abbas Abdolmaleki, Bilal Piot, Bobak Shahriari, Jost Tobias Springenberg Tim Hertweck, Rishabh Joshi, Junhyuk Oh, Michael Bloesch, Thomas Lampe Nicolas Heess, Jonas Buchli, Martin Riedmiller Google Deep Mind. Corresponding Author: Abbas Abdolmaleki <EMAIL>
Pseudocode No The paper presents mathematical derivations, objectives, and update rules (e.g., equation 10), but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide links to a code repository for the methodology described. It acknowledges the Gemma team for models and infrastructure but does not state their own code is open-source.
Open Datasets Yes Our algorithm is first evaluated on synthetic benchmarks: Rosenbrock, Sphere, and Schwefel functions (Hansen et al., 2003). We evaluate our algorithm on a range of control tasks from the Deep Mind Control Suite (Tunyasuvunakool et al., 2020). We consider the RGB Stacking benchmark (Lee et al., 2021). Specifically, we fine-tune a Gemma 2B pre-trained model using a trained reward model (Team, 2024b) using prompts from the LMSYS-chat-1M dataset (Zheng et al., 2023).
Dataset Splits No The paper mentions generating samples and labeling them (e.g., "the top 2 samples are labeled as preferred, the others dis-preferred") and refers to "held-out test prompts", but it does not provide specific numerical percentages or counts for training, validation, or test splits for the overall datasets used.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. It mentions using 'models and infrastructure' from the Gemma team but no explicit hardware specifications.
Software Dependencies No The paper mentions using the Mu Jo Co simulator (Todorov et al., 2012) and Gemma 2B and GPT-4 models, but it does not specify version numbers for any software libraries, frameworks, or environments used in their implementation that would be necessary for reproduction.
Experiment Setup Yes For all the experiments, we will use a beta value of 0.5 for learning from accept&reject, 0.0 for learning from accept only, and 2.0 for learning from reject only, unless stated otherwise. Furthermore, in all experiments except experiment 5.3, the reference policy for all baselines is updated every N steps. In these experiments, we perform one epoch of training, processing a dataset of 500k prompts in approximatively 4000 learner steps, meaning that each batch is composed of 128 prompts and 4 generations per prompt.