reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Behavior Priors for Efficient Reinforcement Learning

Authors: Dhruva Tirumala, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, Wojciech Marian Czarnecki, Arun Ahuja, Yee Whye Teh, Nicolas Heess

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the eﬀectiveness of our framework by applying it to a range of simulated continuous control domains, videos of which can be found at the following url: https://sites.google.com/view/behavior-priors. ... 4. Experiments ... 6. Experiments with structured priors
Researcher Affiliation	Collaboration	1Deep Mind, R7, 14-18 Handyside Street, London, N1C 4DN, UK 2University College London, London WC1E 6BT, UK 3Open AI, 3180 18th St, San Francisco, CA 94110 (contributions while at 1)
Pseudocode	Yes	Algorithm 1 Learning priors: SVG(0) with experience replay ... Algorithm 2 SVG(0) with experience replay for hierarchical policy
Open Source Code	Yes	We have open sourced the code to run the tasks and agent used in this work at https://github.com/deepmind/deepmind-research/tree/master/box_arrangement and https://github.com/deepmind/acme/tree/master/acme/agents/tf/svg0_prior respectively.
Open Datasets	Yes	In this section, we analyze the eﬀect of behavior priors experimentally on a number of simulated motor control domains using walkers from the Deep Mind control suite (Tassa et al., 2018) developed using the Mu Jo Co physics engine (Todorov et al., 2012).
Dataset Splits	No	The locations of the goals, objects and walker are chosen at random for every episode. In other words, each task is a distribution over goals and targets, some of which are harder to solve than others. ... All experiments were run in a distributed setup using a replay buﬀer with 32 CPU actors and 1 CPU learner. ... for each curve, we plot the mean of the best performing conﬁguration averaged across 5 seeds with shaded areas showing the standard error.
Hardware Specification	No	All experiments were run in a distributed setup using a replay buﬀer with 32 CPU actors and 1 CPU learner.
Software Dependencies	No	In this section, we analyze the eﬀect of behavior priors experimentally on a number of simulated motor control domains using walkers from the Deep Mind control suite (Tassa et al., 2018) developed using the Mu Jo Co physics engine (Todorov et al., 2012). ... We used separate ADAM optimizers (Kingma and Ba, 2014) for training the critic, policy and behavior prior.
Experiment Setup	Yes	Appendix F. Experiment details ... F.1 Hyperparameters used for experiments ... F.1.1 Default parameters ... F.1.2 Task specific parameters ... Actor learning rate: βpi = 1e-4. Critic learning rate: βpi = 1e-4. Prior learning rate: βpi = 1e-4. Target network update period: = 100 ... Batch size: 512 Unroll length: 10 Entropy bonus: λ = 1e-4. Distillation cost: α = 1e-3. Posterior entropy cost: α = 1e-3. Number of actors: 32