Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Information Directed Reward Learning for Reinforcement Learning
Authors: David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We support our findings with extensive evaluations in multiple environments and with different query types. |
| Researcher Affiliation | Collaboration | David Lindner Department of Computer Science ETH Zurich EMAIL Matteo Turchetta Department of Computer Science ETH Zurich EMAIL Sebastian Tschiatschek Department of Computer Science University of Vienna EMAIL Kamil Ciosek Spotify EMAIL Andreas Krause Department of Computer Science ETH Zurich EMAIL |
| Pseudocode | Yes | Algorithm 1 Information Directed Reward Learning (IDRL). The algorithm requires a set of candidate queries Qc, a Bayesian model of the reward function, and an RL algorithm that returns a policy given a reward function. ˆG(π) is the belief about the expected return of policy π, induced by the reward model P(ˆr|D), and ˆr is the belief about the reward function. |
| Open Source Code | Yes | Appendices D and E describe the experimental setup in more detail, and we provide code to reproduce all experiments.4 [Footnote 4: https://github.com/david-lindner/idrl] |
| Open Datasets | No | The paper uses simulated environments (Gridworlds, Driver, Mu Jo Co tasks) where data is generated through interaction, rather than relying on pre-existing, publicly available datasets with explicitly defined training sets. |
| Dataset Splits | No | The paper uses simulated environments and does not specify training, validation, or test dataset splits in terms of percentages or sample counts, as it generates data dynamically through interactions. |
| Hardware Specification | No | The paper mentions experiments running on a 'single CPU' or 'single GPU' but does not provide specific CPU/GPU models, types, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'augmented random search', 'Soft Actor-Critic algorithm (SAC)', and 'Open AI Gym' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 3e-4... The batch size is 256... The policy is trained for 107 timesteps... |