Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Research

Authors: Michał Bortkiewicz, Władysław Pałucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Łukasz Kuciński, Benjamin Eysenbach

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By utilizing GPU-accelerated replay buffers, environments, and a stable contrastive RL algorithm, we reduce training time by up to 22 . Additionally, we assess key design choices in contrastive RL, identifying those that most effectively stabilize and enhance training performance. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in diverse and challenging environments. ... Extensive empirical analysis: we evaluate important CRL design choices, focusing on key algorithm components, architecture scaling, and training in data-rich settings. ... The goal of our experiments is twofold: (1) to establish a baseline for the proposed Jax GCRL environments, and (2) to evaluate CRL performance in relation to key design choices. In Section 5.1, we define setup that is used for most of the experiments unless explicitly stated otherwise. First in Section 5.2, we report baseline results on Jax GCRL.
Researcher Affiliation Academia 1Warsaw University of Technology 2University of Warsaw 3UC Berkeley 4Jagiellonian University 5Polish Academy of Sciences 6IDEAS NCBR 7Princeton University
Pseudocode Yes Algorithm 1 Contrastive Reinforcement Learning
Open Source Code Yes Code: https://github.com/Michal Bortkiewicz/Jax GCRL. ... Reproducibility Statement. All experiments can be replicated using the provided publicly available Jax GCRL code at https://github.com/Michal Bortkiewicz/Jax GCRL.
Open Datasets Yes The main goal of this paper is to introduce Jax GCRL: an extremely fast GPU-accelerated codebase and benchmark for effective self-supervised GCRL research. ... To evaluate the performance of GCRL methods, we propose Jax GCRL benchmark consisting of 8 diverse continuous control environments. ... Some benchmarks that have seen adoption include Open AI gym/Gymnasium (Brockman et al., 2022; Towers et al., 2024), Deep Mind Control Suite (Tassa et al., 2018), and D4RL (Fu et al., 2021).
Dataset Splits No Our experiments use Jax GCRL suite of simulated environments described in Section 4.2. We evaluate algorithms in an online setting for 50M environment steps. ... For every environment, we sample evaluation goals from the same distribution as training ones and use a replay buffer of size 10M for CRL, TD3, TD3+HER, SAC, and SAC+HER. ... All experiments are conducted for 50 million environment steps. The text describes online interaction and goal sampling, but not fixed dataset splits for training/validation/testing in a supervised learning sense.
Hardware Specification Yes an experiment with 10 million environment steps lasts only around 10 minutes on a single GPU. ... We used NVIDIA V100 GPU for this experiment.
Software Dependencies No Most of these works rely on JAX (Bradbury et al., 2018; Heek et al., 2023; Hennigan et al., 2020), which enables JIT compilation, operator fusion and other components necessary for efficient vectorized code execution. ... Importantly, BRAX physics simulator differs from the original Mu Jo Co, so performance numbers here may vary slightly from prior work. ... Jax GCRL is a fast implementation of state-based self-supervised reinforcement learning algorithms and a new benchmark of GPU-accelerated environments. Our implementation leverages the power of GPU-accelerated simulators (BRAX and Mu Jo Co MJX) (Freeman et al., 2021; Todorov et al., 2012) While JAX, BRAX, and MuJoCo MJX are mentioned, no specific version numbers are provided for reproducibility in the text.
Experiment Setup Yes 5.1 EXPERIMENTAL SETUP Our experiments use Jax GCRL suite of simulated environments described in Section 4.2. We evaluate algorithms in an online setting for 50M environment steps. We compare CRL with Soft Actor-Critic (SAC) (Haarnoja et al., 2018), SAC with Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), TD3 (Fujimoto et al., 2018), TD3+HER, and PPO (Schulman et al., 2017). For algorithms with HER, we use final strategy relabelling, i.e. relabeling goals with states achieved at the end of the trajectory. In the majority of experiments, we use CRL with L2 energy function, symmetric Info NCE objective, and a tuneable entropy coefficient for all methods. See Appendix B for details. ... The parameters used for benchmarking experiments can be found in Table 2.