Behavior Priors for Efficient Reinforcement Learning
Authors: Dhruva Tirumala, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, Wojciech Marian Czarnecki, Arun Ahuja, Yee Whye Teh, Nicolas Heess
JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our framework by applying it to a range of simulated continuous control domains, videos of which can be found at the following url: https://sites.google.com/view/behavior-priors. ... 4. Experiments ... 6. Experiments with structured priors |
| Researcher Affiliation | Collaboration | 1Deep Mind, R7, 14-18 Handyside Street, London, N1C 4DN, UK 2University College London, London WC1E 6BT, UK 3Open AI, 3180 18th St, San Francisco, CA 94110 (contributions while at 1) |
| Pseudocode | Yes | Algorithm 1 Learning priors: SVG(0) with experience replay ... Algorithm 2 SVG(0) with experience replay for hierarchical policy |
| Open Source Code | Yes | We have open sourced the code to run the tasks and agent used in this work at https://github.com/deepmind/deepmind-research/tree/master/box_arrangement and https://github.com/deepmind/acme/tree/master/acme/agents/tf/svg0_prior respectively. |
| Open Datasets | Yes | In this section, we analyze the effect of behavior priors experimentally on a number of simulated motor control domains using walkers from the Deep Mind control suite (Tassa et al., 2018) developed using the Mu Jo Co physics engine (Todorov et al., 2012). |
| Dataset Splits | No | The locations of the goals, objects and walker are chosen at random for every episode. In other words, each task is a distribution over goals and targets, some of which are harder to solve than others. ... All experiments were run in a distributed setup using a replay buffer with 32 CPU actors and 1 CPU learner. ... for each curve, we plot the mean of the best performing configuration averaged across 5 seeds with shaded areas showing the standard error. |
| Hardware Specification | No | All experiments were run in a distributed setup using a replay buffer with 32 CPU actors and 1 CPU learner. |
| Software Dependencies | No | In this section, we analyze the effect of behavior priors experimentally on a number of simulated motor control domains using walkers from the Deep Mind control suite (Tassa et al., 2018) developed using the Mu Jo Co physics engine (Todorov et al., 2012). ... We used separate ADAM optimizers (Kingma and Ba, 2014) for training the critic, policy and behavior prior. |
| Experiment Setup | Yes | Appendix F. Experiment details ... F.1 Hyperparameters used for experiments ... F.1.1 Default parameters ... F.1.2 Task specific parameters ... Actor learning rate: βpi = 1e-4. Critic learning rate: βpi = 1e-4. Prior learning rate: βpi = 1e-4. Target network update period: = 100 ... Batch size: 512 Unroll length: 10 Entropy bonus: λ = 1e-4. Distillation cost: α = 1e-3. Posterior entropy cost: α = 1e-3. Number of actors: 32 |