Offline Hierarchical Reinforcement Learning via Inverse Optimization
Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Through experiments on robotic tasks, supply chain inventory control, and dynamic vehicle routing, we show how our framework substantially improves the performance of off-the-shelf offline learning algorithms across a diverse set of embodiments and policy structures, while providing the safety guarantees needed for safe, real-world deployment. |
| Researcher Affiliation | Collaboration | Carolin Schmidt1, Daniele Gammelli2, James Harrison3, Marco Pavone2, Filipe Rodrigues1 1Technical University of Denmark, 2Stanford University,3Google Deep Mind EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 OHIO: Offline Hierarchical Reinforcement Learning via Inverse Optimization |
| Open Source Code | Yes | Code and data are available at https://ohio-offline-hierarchical-rl.github.io |
| Open Datasets | Yes | Code and data are available at https://ohio-offline-hierarchical-rl.github.io |
| Dataset Splits | Yes | All datasets used for this experiment consist of 250 episodes of interactions (each consisting of 1000 timesteps). To learn the dynamics model, we use a train/val split of 0.9/0.1. |
| Hardware Specification | Yes | The training of our models was executed on a Tesla V100 16 GB GPU. |
| Software Dependencies | No | No specific software dependencies with version numbers are explicitly listed in the paper. |
| Experiment Setup | Yes | Table 6: Hyperparameters of SAC. Parameter Value Optimizer Adam Learning rate 1 10 3 Discount (γ) 0.97 Batch size 100 Entropy coefficient 0.3 Target smoothing coefficient (τ) 0.005 Target update interval 1 Gradient step/env.interaction 1 |