MOORL: A Framework for Integrating Offline-Online Reinforcement Learning
Authors: Gaurav Chaudhary, Washim Uddin Mondal, Laxmidhar Behera
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines. |
| Researcher Affiliation | Academia | Gaurav Chaudhary EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur Washim Uddin Mondal EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur Laxmidhar Behera EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur |
| Pseudocode | Yes | Algorithm 1 MOORL: Meta Offline-Online Reinforcement Learning 1: Initialize: Meta-policy parameters actor ϕmeta, critic θmeta, and temperature α. 2: Offline dataset Doffline and empty online buffer Donline. 3: Meta-learning rate ηmeta, inner-loop learning rate η, number of iterations N. 4: for n = 1 to N do 5: Select Buffer: Choose Doffline or Donline as the data buffer. 6: Inner-loop Adaptation: 7: Collect transition in online environment and store in Donline. 8: Sample mini-batch from the selected data buffer. 9: Perform K inner actor ϕ and critic θ updates using data from Di. 10: Meta-update: 11: Update meta-policy parameters of both actor and critic using ϕmeta ϕmeta ηmeta ϕmeta[L( ϕ)] and θmeta θmeta ηmeta θmeta[L( θ)], respectively. 12: end for |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code, nor does it provide a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate the proposed MOORL approach on the D4RL benchmark (Fu et al., 2020) and V-D4RL (Lu et al., 2022), comparing its performance against state-of-the-art methods |
| Dataset Splits | Yes | To evaluate MOORL s performance, we select a range of D4RL tasks to assess robustness across diverse data distributions: D4RL Locomotion(Fu et al., 2020): This set includes 15 dense-reward locomotion tasks, with offline data covering varying levels of optimality, from expert to random trajectories. D4RL Maze-Navigation(Fu et al., 2020): We utilize 6 Ant Maze navigation tasks with sparse binary reward structures, each with different complexities. D4RL Adroit (Fu et al., 2020): The tasks in this set (Pen, Door, Hammer) involve complex manipulation and sparse rewards, with offline data consisting of expert-level trajectories. V-D4RL: Deep Mind Control Suite (DMC): The DMC tasks involve controlling physics-based agents with dense rewards that encourage smooth, efficient movement. The standard evaluation metric is the normalized score, computed using the agent s return normalized against the performance of a well-trained SAC policy. The datasets include expert and medium policies, allowing evaluation of an agent s ability to learn from varying data quality. |
| Hardware Specification | Yes | RLPD takes approx 0.5sec while MOORL takes 0.05sec per timestep when run on a single RTX A4000 GPU. |
| Software Dependencies | No | To implement the proposed framework, we use entropy-regularized SAC (Haarnoja et al., 2018) as the base RL algorithm and apply Reptile (Nichol & Schulman, 2018) for meta-updates, improving generalization across distributions. The paper mentions algorithms and optimizers but does not provide specific version numbers for software libraries like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | Table 5: MOORL Hyperparameters Parameter Value Batch size 256 Discount (γ) 0.99 Optimizer Adam Learning rate 3 10 4 Critic EMA Weight (ρ) 0.005 Inner Gradient Steps (K) 4 Network Width 256 Units Number of Layers 2 Initial Entropy Temperature (α) 1.0 Target Entropy dim(A) Each training iteration consists of K = 4 SAC updates followed by a Reptile-style (Nichol & Schulman, 2018) metaupdate. A learning rate of 3 10 4 for inner-loop adaptation is used, while for meta-updates, the learning rate is dynamically adjusted (Nichol & Schulman, 2018) based on the ratio of the current timestep to total timesteps. We maintain an exponentially moving average target Q-network with an update weight of ρ = 0.005. Hyperparameters are summarized in Table 5. |