From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Authors: Gaurav Chaudhary, Laxmidhar Behera

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the D4RL benchmark demonstrate that Re LOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods. Extensive evaluations on the D4RL benchmark (Fu et al., 2020) show that Re LOAD performs competitively with traditional reward-based offline RL approaches. We conduct extensive experiments on standard offline RL and offline IL benchmarks to answer these questions, including D4RL Locomotion, Antmaze, and Adroit. To further assess robustness, we conduct an ablation study by varying the number of high-quality demonstrations and examining the impact on learning performance.
Researcher Affiliation Academia Gaurav Chaudhary EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur Laxmidhar Behera EMAIL Department of Electrical Engineering Indian Institute of Technology Kanpur
Pseudocode Yes Algorithm 1 Self-Supervised Reward Annotation via RND 1: Inputs: Offline dataset D = {(s, a, s )}, Expert transitions De = {(s, s )}, Target network fψ, Predictor network gθ, Learning rate η, Number of epochs Npred, Hyperparameters α, β 2: Initialize: Fixed target network fψ with parameters ψ, Trainable predictor network gθ with parameters θ 3: Step 1: Pretraining the Predictor with RND
Open Source Code No The paper does not provide an explicit statement from the authors about releasing their own code for the Re LOAD methodology, nor does it provide a direct link to a repository for their specific implementation. It only references using the official IQL implementation for comparison: "We implement Re LOAD in JAX (Bradbury et al., 2018) and use official IQL implementation1. https://github.com/ikostrikov/implicit_q_learning"
Open Datasets Yes Experiments on the D4RL benchmark demonstrate that Re LOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods. Extensive evaluations on the D4RL benchmark (Fu et al., 2020) show that Re LOAD performs competitively with traditional reward-based offline RL approaches. We conduct extensive experiments on standard offline RL and offline IL benchmarks to answer these questions, including D4RL Locomotion, Antmaze, and Adroit.
Dataset Splits No The paper mentions creating
Hardware Specification Yes All of the experiments were conducted on a single NVIDIA A40 GPU. The runtime overhead of Re LOAD is less than a minute (compared to the offline RL algorithm training time), including training the predictor network and annotating offline data.
Software Dependencies No The paper mentions "We implement Re LOAD in JAX (Bradbury et al., 2018) and use official IQL implementation1." It names JAX and IQL but does not provide specific version numbers for these software components.
Experiment Setup Yes Table 4: Hyperparameters for training RND. Setting Mu Jo Co and Antmaze Adroit Hidden dim 256 256 Number of layers 2 2 Batch size 32 16 Number of iterations 102 10 Learning rate 1e 3 3e 4 Number of expert trajectories 1 1 Table 5: Hyperparameters used in the Re LOAD framework. These values were used across all experiments. Hyperparameter Value Description Offline RL Algorithm IQL Choice of offline RL algorithm Policy Learning Rate 3e-4 Learning rate for the policy network Critic Learning Rate 3e-4 Learning rate for the critic network Value Learning Rate 3e-4 Learning rate for value network Discount Factor (γ) 0.99 Discount factor for future rewards Training Steps 1e6 Number of training steps for IQL Batch Size 256 Batch size for IQL training Table 6: Task dependent hyperparameters Task Expectile Temperature α β Locomotion 0.7 6 10 5 Ant Maze 0.9 10 10 1 Adroit 0.8 0.5 10 5