SR-Reward: Taking The Path More Traveled
Authors: Seyed Mahdi B. Azad, Zahra Padar, Gabriel Kalweit, Joschka Boedecker
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on D4RL as well as Maniskill Robot Manipulation environments, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning. Code available at: https://github.com/Erfi/SR-Reward |
| Researcher Affiliation | Academia | Seyed Mahdi B. Azad EMAIL Department of Computer Science University of Freiburg; Zahra Padar EMAIL Department of Computer Science University of Freiburg; Gabriel Kalweit EMAIL Department of Computer Science University of Freiburg; Joschka Boedecker EMAIL Department of Computer Science University of Freiburg |
| Pseudocode | Yes | Algorithm 1 shows the pseudocode for training the SR-Reward and the offline RL in the same loop. |
| Open Source Code | Yes | Code available at: https://github.com/Erfi/SR-Reward |
| Open Datasets | Yes | We evaluate our method on D4RL as well as Maniskill Robot Manipulation environments... We use the offline datasets from D4RL (Fu et al., 2020), and follow their normalization procedures... For Maniskill2 environments (Pick Cube, Stack Cube, Turn Faucet)... |
| Dataset Splits | No | The paper uses well-known benchmark datasets (D4RL, Maniskill2) but does not explicitly describe how these datasets are split into training, validation, and test sets. It describes the evaluation protocol (e.g., '25 evaluation rollouts', '50 fresh rollouts') but not dataset splitting. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Hyperparameters remain largely consistent across environments, with key parameters listed in Table 2 (Appendix D). Each agent is trained for one million gradient steps (two million gradient steps for Mani Skill environments), with five different random seeds used per task. |