reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Survey of Preference-Based Reinforcement Learning Methods

Authors: Christian Wirth, Riad Akrour, Gerhard Neumann, Johannes Fürnkranz

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	A Survey of Preference-Based Reinforcement Learning Methods. In this paper, we will survey and categorize preference-based formulations of reinforcement learning and make these assumptions explicit. We also show the relation to other RL-based settings that deviate from the basic setting, such as inverse reinforcement learning or learning with advice. This paper is based on a preliminary survey (Wirth and Fürnkranz, 2013c), but has been extended in almost every aspect.
Researcher Affiliation	Academia	Christian Wirth EMAIL Knowledge Engineering Group, Technische Universität Darmstadt Hochschulstraße 10, 64289 Darmstadt, Germany; Riad Akrour EMAIL Computational Learning for Autonomous Systems, Technische Universität Darmstadt Hochschulstraße 10, 64289 Darmstadt, Germany; Gerhard Neumann EMAIL Computational Learning, School of Computer Science, University of Lincoln Brayford Pool, Lincoln, LN6 7TS, Great Britain; Johannes Fürnkranz EMAIL Knowledge Engineering Group, Technische Universität Darmstadt Hochschulstraße 10, 64289 Darmstadt, Germany
Pseudocode	Yes	Algorithm 1 Policy Likelihood Require: prior Pr(π), step limit k, sample limit m, iteration limit n; Algorithm 2 Policy Ranking Require: candidate policies Π0, step limit k, sample limit m, iteration limit n; Algorithm 3 Preference-based Approximate Policy Iteration Require: initial policy π0, iteration limit m, state sample limit k, rollout limit n; Algorithm 4 Utility-based Pb RL Require: initial policy π0, iteration limit m, state sample limit k, rollout limit n
Open Source Code	No	The paper is a survey of preference-based reinforcement learning methods and describes existing algorithms and their characteristics. It does not present new experimental work or methodology requiring specific code release from the authors of this survey. Therefore, no explicit mention or link to open-source code for the methodology described in this paper is provided.
Open Datasets	No	The paper is a survey and does not present new experimental work using specific datasets. It references datasets used in other research, such as 'Professional chess games are commonly stored in large databases' (Section 4.3) and the ATARI learning environment and Mu Jo Co framework (Section 4.4), but these are datasets used by the authors of the surveyed papers, not by the authors of this survey paper for their own experiments.
Dataset Splits	No	This paper is a survey and does not describe experiments with its own dataset, thus there is no mention of training/test/validation dataset splits. While it mentions how other papers might use data, it does not specify any splits relevant to its own content.
Hardware Specification	No	This paper is a survey of existing methods and does not report on new experimental results by its authors. Therefore, it does not provide hardware specifications for experiments conducted within the scope of this paper. Any hardware mentioned (e.g., 'real 7-DoF KUKA LBR arm' in Section 4.1) refers to equipment used in the surveyed research, not the authors' own work.
Software Dependencies	No	This paper is a survey and does not present new experimental work or methodology that would require specific software dependencies with version numbers from its authors. While it lists various algorithms and tools (e.g., 'CMA-ES', 'LSTD', 'TRPO', 'A3C') in Table 2, these are software mentioned in the context of the surveyed papers, not dependencies for the survey itself.
Experiment Setup	No	This paper is a survey and does not present new experimental work by its authors. Therefore, it does not describe an experimental setup with hyperparameters or system-level training settings. Any experimental details mentioned (e.g., 'evaluated their system on a wide range of path planning tasks' in Section 4.2) refer to the experimental setups of the surveyed research, not the authors' own work.