POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning

Authors: Frederik Schubert, Carolin Benjamins, Sebastian Döhler, Bodo Rosenhahn, Marius Lindauer

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of dataand knowledge-based URL algorithms by 19% on average and up to 40% in the best case.
Researcher Affiliation Academia Frederik Schubert EMAIL Institute for Information Processing Leibniz University Hannover; Carolin Benjamins EMAIL Institute of Artificial Intelligence Leibniz University Hannover; Sebastian Döhler EMAIL Institute for Information Processing Leibniz University Hannover; Bodo Rosenhahn EMAIL Institute for Information Processing Leibniz University Hannover; Marius Lindauer EMAIL Institute of Artificial Intelligence Leibniz University Hannover
Pseudocode Yes Algorithm 1 URL+POLTER Require: Initialize URL algorithm, policy πθ,0, replay buffer D, pretraining steps NPT URL init Require: Empty ensemble E = , ensemble snapshot time steps TE, regularization weight α POLTER init 1: for t = 0 . . . NPT 1 do Unsupervised pretraining 2: if Beginning of episode then 3: Observe initial state st p0(s0) 4: if t TE then Update ensemble policy 5: Extend ensemble E E πθ,t 6: Update ensemble policy π 7: Choose action at πθ,t(at | st) 8: Observe next state st+1 p(st+1 | st, at) 9: Add transition to replay buffer D D (st, at, st+1) 10: Sample a minibatch B D 11: Compute loss LPOLTER = LURL(πθ,t)+αDKL( π πθ,t) 12: Update policy πθ,t with LPOLTER 13: . . . Supervised finetuning on task T
Open Source Code Yes We use the provided code from Laskin et al. (2021) to aid reproducibility and provide our source code in the supplementary material.
Open Datasets Yes Extensive experiments on the Unsupervised Reinforcement Learning Benchmark (URLB) (Laskin et al., 2021) show that our method improves the Interquartile Mean (IQM) (Agarwal et al., 2021) performance of dataand knowledge-based URL algorithms on average by 19%.
Dataset Splits No The paper does not provide typical train/test/validation dataset splits as understood in supervised learning contexts. In reinforcement learning, the data is generated dynamically. While it specifies evaluation on '10 episodes after finetuning for 100 k steps', it does not define fixed dataset splits for reproducibility in the way the question implies.
Hardware Specification Yes All experiments were run on our internal compute cluster on NVIDIA RTX 1080 Ti and NVIDIA RTX 2080 Ti GPUs and had 64GB of RAM and 10 CPU cores.
Software Dependencies No The paper refers to using "the provided code from Laskin et al. (2021)" but does not specify any software dependencies with version numbers. For example, it does not list Python version, PyTorch version, or any other library versions used.
Experiment Setup Yes For its hyperparameters and the setup of POLTER, see Appendix E. We follow Agarwal et al. (2021) in their evaluation using the IQM as our main metric. ... POLTER Hyperparameters During pretraining, we construct the mixture ensemble policy π with k = 7 members at specific time steps TE. For adding each member we choose the ensemble snapshot time steps TE = {25 k, 50 k, 100 k, 200 k, 400 k, 800 k, 1.6 M}. ... We set the regularization strength α = 1 and use the same hyperparameters for each of the three domains unless specified otherwise. ... Table 2: Hyperparameters for the DDPG algorithm. Replay buffer capacity 1e6, Action repeat 1, Seed frames 4000, n-step returns 3, Batch size 1024, Discount factor γ 0.99, Optimizer Adam, Learning rate 1e-4, Agent update frequency 2, Critic target EMA rate 0.01, Feature size 1024, Hidden size 1024, Exploration noise std clip 0.3, Exploration noise std value 0.2, Pretraining frames 2e6, Finetuning frames 1e5.