Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Authors: Marco Bagatella, Andreas Krause, Georg Martius

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.
Researcher Affiliation Academia Marco Bagatella EMAIL Department of Computer Science ETH Zürich, Zürich, Switzerland Andreas Krause Department of Computer Science ETH Zürich, Zürich, Switzerland Georg Martius Max Planck Institute for Intelligent Systems Tübigen, Germany
Pseudocode Yes Algorithm 1 DRL2
Open Source Code Yes code is available at github.com/marbaga/drl2.
Open Datasets Yes On the top of Figure 5 a simulated Fetch robotic arm (de Lazcano et al., 2023) is evaluated on two tasks [...] In the middle, an Half Cheetah receives specifications encoding, respectively, finite sequences of positions and infinite sequences of angles for its center of mass [...] On the bottom, a 12Do F simulated quadruped robot (Ray et al., 2019) is tasked with (i) fully traversing a narrow corridor, or with (ii) navigating through two zones in sequence.
Dataset Splits No The paper describes simulation environments and episode lengths for RL training rather than static datasets with explicit train/test/validation splits. For instance, in tabular settings: "Episodes have a duration equal to the minimum number of step to reach the final zone, plus 10 steps." and "Episodes have a fixed length of 70 steps."
Hardware Specification Yes Each experimental run required 8 cores of a modern CPU (Intel i7 12th Gen CPU or equivalent).
Software Dependencies No Our codebase mostly relies on numpy (Harris et al., 2020) for numeric computation and torch (Paszke et al., 2017) for its autograd functionality. Furthermore, we partially automate the synthesis of LDBAs from LTL formulas through rabinizer (Křetínsk y et al., 2018). While software packages are mentioned, specific version numbers for numpy, torch, and rabinizer are not provided, only citation years.
Experiment Setup Yes The main hyperparameter introduced by DRL2 is α, which controls the strength of its prior as described in Section 3.2. It is set to 10^3 in tabular settings, and to 10^5 in continuous settings, where increased stability was found to be beneficial. The remaining hyperparameters for reward shaping are shared with the count-based baseline and are, respectively, the frequency of updates to the potential function (set to 2000 environment steps), and a scaling coefficient to the intrinsic reward, which was tuned individually for each method in a grid of [0.1, 1.0, 10.0]. We found a coefficient of 0.1 to be optimal across all tasks for both methods in tabular settings, while continuous settings benefit from a stronger signal and a coefficient of 10.0. [...] Hyperparameters for each algorithm were tuned to perform best when no exploration bonus is being used, and are reported in Tables 1. They are kept fixed across tasks. Table 1: Hyperparameters for Q-learning and Soft Actor Critic. Hyperparameter Value γ 0.99 Buffer size 4 10^5 Batch size 64 Initial exploration 2 10^3 steps τ (Polyak) 5 10^3 for SAC Learning rate 3 10^4 for SAC, 1 for Q-learning