A Survey of Reinforcement Learning from Human Feedback
Authors: Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent focus has been on RLHF for LLMs, our survey covers the technique across multiple domains. We examine the core principles that underpin RLHF, how algorithms and human feedback work together, and discuss the main research trends in the field. Our goal is to give researchers and practitioners a clear understanding of this rapidly growing field. |
| Researcher Affiliation | Academia | Timo Kaufmann EMAIL LMU Munich, MCML Munich Paul Weng EMAIL Digital Innovation Research Center, Duke Kunshan University Viktor Bengs EMAIL German Research Center for Artificial Intelligence (DFKI) Eyke Hüllermeier EMAIL LMU Munich, MCML Munich, DFKI Kaiserslautern |
| Pseudocode | Yes | Algorithm 1 Generic RLHF Algorithm in an Actor-Critic Scheme. |
| Open Source Code | No | The paper is a survey and summarizes existing research. It does not introduce a new methodology that would have associated open-source code from the authors. While it mentions several existing open-source libraries related to RLHF, these are third-party tools, not code for the specific work described in this survey. |
| Open Datasets | Yes | Particularly notable are hh-rlhf (Bai et al., 2022a) and PKU-Safe-RLHF (Ji et al., 2023a), two datasets focusing on harmless and helpful responses, the Open Assistant datasets (oasst1, oasst2) (Köpf et al., 2023), containing not only response rankings but also ratings on various dimensions, the summarize_from_feedback dataset (Stiennon et al., 2020) focusing on preferences over text summaries, the Stanford Human Preferences Dataset (SHP) (Ethayarajh et al., 2022), which is based on Reddit responses, the Web GPT dataset (webgpt_comparisons) (Nakano et al., 2022), focused on long-form question answering and the Help Steer (Wang et al., 2024d) dataset, which is not based on preferences but instead gives ratings on for 4 attributes (helpfulness, correctness, coherence, complexity) for each response. |
| Dataset Splits | No | The paper is a survey and does not perform its own experiments. Therefore, it does not provide specific training/test/validation dataset splits for its own work. It discusses how other works handle data, but not its own splits. |
| Hardware Specification | No | The paper is a survey and does not perform its own experiments. Therefore, it does not describe the hardware specifications used for its own work. |
| Software Dependencies | No | The paper is a survey and summarizes existing research. It does not present a specific methodology with its own software dependencies. While it mentions supporting libraries developed by others (e.g., 'trl X: A scalable framework for RLHF'), these are not ancillary software dependencies for the survey itself. |
| Experiment Setup | No | The paper is a survey and does not conduct its own experiments. Therefore, it does not provide details about an experimental setup or hyperparameters for its own work. |