reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MOORL: A Framework for Integrating Offline-Online Reinforcement Learning

Authors: Gaurav Chaudhary, Washim Uddin Mondal, Laxmidhar Behera

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines.
Researcher Affiliation	Academia	Gaurav Chaudhary EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur Washim Uddin Mondal EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur Laxmidhar Behera EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur
Pseudocode	Yes	Algorithm 1 MOORL: Meta Offline-Online Reinforcement Learning 1: Initialize: Meta-policy parameters actor ϕmeta, critic θmeta, and temperature α. 2: Offline dataset Doffline and empty online buffer Donline. 3: Meta-learning rate ηmeta, inner-loop learning rate η, number of iterations N. 4: for n = 1 to N do 5: Select Buffer: Choose Doffline or Donline as the data buffer. 6: Inner-loop Adaptation: 7: Collect transition in online environment and store in Donline. 8: Sample mini-batch from the selected data buffer. 9: Perform K inner actor ϕ and critic θ updates using data from Di. 10: Meta-update: 11: Update meta-policy parameters of both actor and critic using ϕmeta ϕmeta ηmeta ϕmeta[L( ϕ)] and θmeta θmeta ηmeta θmeta[L( θ)], respectively. 12: end for
Open Source Code	No	The paper does not contain an explicit statement about releasing code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	We evaluate the proposed MOORL approach on the D4RL benchmark (Fu et al., 2020) and V-D4RL (Lu et al., 2022), comparing its performance against state-of-the-art methods
Dataset Splits	Yes	To evaluate MOORL s performance, we select a range of D4RL tasks to assess robustness across diverse data distributions: D4RL Locomotion(Fu et al., 2020): This set includes 15 dense-reward locomotion tasks, with offline data covering varying levels of optimality, from expert to random trajectories. D4RL Maze-Navigation(Fu et al., 2020): We utilize 6 Ant Maze navigation tasks with sparse binary reward structures, each with different complexities. D4RL Adroit (Fu et al., 2020): The tasks in this set (Pen, Door, Hammer) involve complex manipulation and sparse rewards, with offline data consisting of expert-level trajectories. V-D4RL: Deep Mind Control Suite (DMC): The DMC tasks involve controlling physics-based agents with dense rewards that encourage smooth, efficient movement. The standard evaluation metric is the normalized score, computed using the agent s return normalized against the performance of a well-trained SAC policy. The datasets include expert and medium policies, allowing evaluation of an agent s ability to learn from varying data quality.
Hardware Specification	Yes	RLPD takes approx 0.5sec while MOORL takes 0.05sec per timestep when run on a single RTX A4000 GPU.
Software Dependencies	No	To implement the proposed framework, we use entropy-regularized SAC (Haarnoja et al., 2018) as the base RL algorithm and apply Reptile (Nichol & Schulman, 2018) for meta-updates, improving generalization across distributions. The paper mentions algorithms and optimizers but does not provide specific version numbers for software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	Table 5: MOORL Hyperparameters Parameter Value Batch size 256 Discount (γ) 0.99 Optimizer Adam Learning rate 3 10 4 Critic EMA Weight (ρ) 0.005 Inner Gradient Steps (K) 4 Network Width 256 Units Number of Layers 2 Initial Entropy Temperature (α) 1.0 Target Entropy dim(A) Each training iteration consists of K = 4 SAC updates followed by a Reptile-style (Nichol & Schulman, 2018) metaupdate. A learning rate of 3 10 4 for inner-loop adaptation is used, while for meta-updates, the learning rate is dynamically adjusted (Nichol & Schulman, 2018) based on the ratio of the current timestep to total timesteps. We maintain an exponentially moving average target Q-network with an update weight of ρ = 0.005. Hyperparameters are summarized in Table 5.