reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Information Directed Reward Learning for Reinforcement Learning

Authors: David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We support our ﬁndings with extensive evaluations in multiple environments and with different query types.
Researcher Affiliation	Collaboration	David Lindner Department of Computer Science ETH Zurich EMAIL Matteo Turchetta Department of Computer Science ETH Zurich EMAIL Sebastian Tschiatschek Department of Computer Science University of Vienna EMAIL Kamil Ciosek Spotify EMAIL Andreas Krause Department of Computer Science ETH Zurich EMAIL
Pseudocode	Yes	Algorithm 1 Information Directed Reward Learning (IDRL). The algorithm requires a set of candidate queries Qc, a Bayesian model of the reward function, and an RL algorithm that returns a policy given a reward function. ˆG(π) is the belief about the expected return of policy π, induced by the reward model P(ˆr\|D), and ˆr is the belief about the reward function.
Open Source Code	Yes	Appendices D and E describe the experimental setup in more detail, and we provide code to reproduce all experiments.4 [Footnote 4: https://github.com/david-lindner/idrl]
Open Datasets	No	The paper uses simulated environments (Gridworlds, Driver, Mu Jo Co tasks) where data is generated through interaction, rather than relying on pre-existing, publicly available datasets with explicitly defined training sets.
Dataset Splits	No	The paper uses simulated environments and does not specify training, validation, or test dataset splits in terms of percentages or sample counts, as it generates data dynamically through interactions.
Hardware Specification	No	The paper mentions experiments running on a 'single CPU' or 'single GPU' but does not provide specific CPU/GPU models, types, or other detailed hardware specifications.
Software Dependencies	No	The paper mentions 'augmented random search', 'Soft Actor-Critic algorithm (SAC)', and 'Open AI Gym' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-4... The batch size is 256... The policy is trained for 107 timesteps...