Offline Hierarchical Reinforcement Learning via Inverse Optimization

Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. Through experiments on robotic tasks, supply chain inventory control, and dynamic vehicle routing, we show how our framework substantially improves the performance of off-the-shelf offline learning algorithms across a diverse set of embodiments and policy structures, while providing the safety guarantees needed for safe, real-world deployment.
Researcher Affiliation Collaboration Carolin Schmidt1, Daniele Gammelli2, James Harrison3, Marco Pavone2, Filipe Rodrigues1 1Technical University of Denmark, 2Stanford University,3Google Deep Mind EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 OHIO: Offline Hierarchical Reinforcement Learning via Inverse Optimization
Open Source Code Yes Code and data are available at https://ohio-offline-hierarchical-rl.github.io
Open Datasets Yes Code and data are available at https://ohio-offline-hierarchical-rl.github.io
Dataset Splits Yes All datasets used for this experiment consist of 250 episodes of interactions (each consisting of 1000 timesteps). To learn the dynamics model, we use a train/val split of 0.9/0.1.
Hardware Specification Yes The training of our models was executed on a Tesla V100 16 GB GPU.
Software Dependencies No No specific software dependencies with version numbers are explicitly listed in the paper.
Experiment Setup Yes Table 6: Hyperparameters of SAC. Parameter Value Optimizer Adam Learning rate 1 10 3 Discount (γ) 0.97 Batch size 100 Entropy coefficient 0.3 Target smoothing coefficient (τ) 0.005 Target update interval 1 Gradient step/env.interaction 1