reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Survey of Reinforcement Learning from Human Feedback

Authors: Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This article provides an overview of the fundamentals of RLHF, exploring how RL agents interact with human feedback. While recent focus has been on RLHF for LLMs, our survey covers the technique across multiple domains. We examine the core principles that underpin RLHF, how algorithms and human feedback work together, and discuss the main research trends in the field. Our goal is to give researchers and practitioners a clear understanding of this rapidly growing field.
Researcher Affiliation	Academia	Timo Kaufmann EMAIL LMU Munich, MCML Munich Paul Weng EMAIL Digital Innovation Research Center, Duke Kunshan University Viktor Bengs EMAIL German Research Center for Artificial Intelligence (DFKI) Eyke Hüllermeier EMAIL LMU Munich, MCML Munich, DFKI Kaiserslautern
Pseudocode	Yes	Algorithm 1 Generic RLHF Algorithm in an Actor-Critic Scheme.
Open Source Code	No	The paper is a survey and summarizes existing research. It does not introduce a new methodology that would have associated open-source code from the authors. While it mentions several existing open-source libraries related to RLHF, these are third-party tools, not code for the specific work described in this survey.
Open Datasets	Yes	Particularly notable are hh-rlhf (Bai et al., 2022a) and PKU-Safe-RLHF (Ji et al., 2023a), two datasets focusing on harmless and helpful responses, the Open Assistant datasets (oasst1, oasst2) (Köpf et al., 2023), containing not only response rankings but also ratings on various dimensions, the summarize_from_feedback dataset (Stiennon et al., 2020) focusing on preferences over text summaries, the Stanford Human Preferences Dataset (SHP) (Ethayarajh et al., 2022), which is based on Reddit responses, the Web GPT dataset (webgpt_comparisons) (Nakano et al., 2022), focused on long-form question answering and the Help Steer (Wang et al., 2024d) dataset, which is not based on preferences but instead gives ratings on for 4 attributes (helpfulness, correctness, coherence, complexity) for each response.
Dataset Splits	No	The paper is a survey and does not perform its own experiments. Therefore, it does not provide specific training/test/validation dataset splits for its own work. It discusses how other works handle data, but not its own splits.
Hardware Specification	No	The paper is a survey and does not perform its own experiments. Therefore, it does not describe the hardware specifications used for its own work.
Software Dependencies	No	The paper is a survey and summarizes existing research. It does not present a specific methodology with its own software dependencies. While it mentions supporting libraries developed by others (e.g., 'trl X: A scalable framework for RLHF'), these are not ancillary software dependencies for the survey itself.
Experiment Setup	No	The paper is a survey and does not conduct its own experiments. Therefore, it does not provide details about an experimental setup or hyperparameters for its own work.