Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Authors: Zhiyuan Zhou, Andy Peng, Qiyang Li, Sergey Levine, Aviral Kumar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL EVALUATION The goal of our experiments is to study how well WSRL is able to fine-tune online without offline data retention. We also ablate the design decisions in WSRL to understand the efficacy of WSRL. Concretely, we study the following research questions: (1) Can WSRL enable efficient fine-tuning in the no-retention setting?; (2) How does WSRL compare with methods that do retain offline data?; (3) How critical is the warmup phase in WSRL?; (4) How important is it to use online RL algorithm for online fine-tuning?, and (5) How important is it to pre-train the policy, value function, or both? We experiment on Antmaze, Kitchen, and Adroit tasks from D4RL (Fu et al., 2020a) and the Gym Mu Jo Co locomotion tasks1. More discussion is in Appendix C.
Researcher Affiliation Academia Zhiyuan Zhou 1, Andy Peng 1, Qiyang Li1, Sergey Levine1, Aviral Kumar2 1UC Berkeley, 2Carnegie Mellon University
Pseudocode Yes Algorithm 1 WSRL: Warm Start Reinforcement Learning
Open Source Code Yes Code for WSRL is released at https://github.com/zhouzypaul/wsrl.
Open Datasets Yes We experiment on Antmaze, Kitchen, and Adroit tasks from D4RL (Fu et al., 2020a) and the Gym Mu Jo Co locomotion tasks1.
Dataset Splits No The paper discusses using D4RL datasets for pre-training and collecting online rollouts but does not provide explicit train/test/validation splits for these datasets or for the online collected data. For example: "We pre-train 1M steps on Antmaze, 20k steps on Adroit, 250k steps on Kitchen and Mujoco locomotion." This specifies pre-training data size, not splits.
Hardware Specification No We thank TPU Research Cloud (TRC) and Google Cloud for generous compute donations that made this work possible. While this indicates the compute provider and general hardware type, it does not specify exact models (e.g., TPU v3) or quantities of the hardware used.
Software Dependencies No The paper details algorithmic choices and hyperparameters (e.g., "online SAC implementation in RLPD", "UTD of 4", "actor learning rate of 1e 4"), but it does not specify versions for core software libraries like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes WSRL Hyperparameters. We use 5K warmup steps (K = 5, 000). For the online RL algorithm in WSRL, we use the online SAC (Haarnoja et al., 2018b) implementation in RLPD (Ball et al., 2023) with a UTD of 4 and actor delay of 4 (update the actor once for every four critic steps), batch size of 256, actor learning rate of 1e 4, critic learning rate of 3e 4, and temperature learning rate of 1e 4. We use and ensemble of 10 Q functions, and predict the Q value by randomly sub-sampling 2 and taking the min over the 2 Q-functions (Chen et al., 2021).