Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Authors: Cong Lu, Philip J. Ball, Tim G. J. Rudner, Jack Parker-Holder, Michael A Osborne, Yee Whye Teh

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we establish simple baselines for continuous control in the visual domain and introduce a suite of benchmarking tasks for offline reinforcement learning from visual observations... Using this suite of benchmarking tasks, we show that simple modifications to two popular vision-based online reinforcement learning algorithms, Dreamer V2 and Dr Qv2, suffice to outperform existing offline RL methods and establish competitive baselines for continuous control in the visual domain. We rigorously evaluate these algorithms and perform an empirical evaluation of the differences between state-of-the-art model-based and model-free offline RL methods for continuous control from visual observations. All code and data used in this evaluation are open-sourced to facilitate progress in this domain.
Researcher Affiliation Academia Cong Lu EMAIL University of Oxford; Philip J. Ball EMAIL University of Oxford; Tim G. J. Rudner EMAIL University of Oxford; Jack Parker-Holder EMAIL University of Oxford; Michael A. Osborne EMAIL University of Oxford; Yee Whye Teh EMAIL University of Oxford
Pseudocode No The paper describes algorithms and methods using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. For instance, Section B 'Algorithmic Details' describes the modifications to Dreamer V2 and Dr Q+BC.
Open Source Code Yes All code and data used in this evaluation are open-sourced to facilitate progress in this domain. Open-sourced code and data for the v-d4rl benchmarking suite are available at: https://github.com/conglu1997/v-d4rl.
Open Datasets Yes All code and data used in this evaluation are open-sourced to facilitate progress in this domain. Open-sourced code and data for the v-d4rl benchmarking suite are available at: https://github.com/conglu1997/v-d4rl.
Dataset Splits Yes From these environments, we follow a d4rl-style procedure in considering five different behavioral policies for gathering the data. As in d4rl, the base policy used to gather the data is Soft Actor Critic (SAC, Haarnoja et al. (2018)) on the proprioceptive states. We consider the following five settings: random: Uniform samples from the action space. medium-replay (mixed): The initial segment of the replay buffer until the SAC agent reaches medium-level performance. medium: Rollouts of a fixed medium-performance policy. expert: Rollouts of a fixed expert-level policy. medium-expert (medexp): Concatenation of medium and expert datasets above. ... By default, each dataset consists of 100,000 total transitions (often 10 less than in d4rl) ... The cheetah and humanoid medium-replay datasets consist of 200,000 and 600,000 transitions respectively... Full statistics of each dataset are given in Appendix A.
Hardware Specification Yes The experiments in this paper were run on NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software components such as Dreamer V2, Dr Q-v2, and NumPy (Harris et al., 2020), and provides links to their repositories, but it does not specify concrete version numbers for any of these software dependencies.
Experiment Setup Yes Table 7 lists the hyperparameters used for Offline DV2. For other hyperparameter values, we used the default values in the Dreamer V2 repository. ... Parameter Value(s) ensemble member count (K) 7 imagination horizon (H) 5 batch size 64 sequence length (L) 50 action repeat 2 observation size [64, 64] discount (γ) 0.99 optimizer Adam learning rate {model = 3 10 4, actor-critic = 8 10 5} model training epochs 800 agent training epochs 2,400 uncertainty penalty mean disagreement uncertainty weight (λ) in [3, 10]. ... Table 9 lists the hyperparameters used for Dr Q+BC. Parameter Value batch size 256 action repeat 2 observation size [84, 84] discount (γ) 0.99 optimizer Adam learning rate 1 10 4 agent training epochs 256 n-step returns. 3 Exploration stddev. clip 0.3 Exploration stddev. schedule. linear(1.0, 0.1, 500000) BC Weight (α) 2.5.