reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Authors: Marco Bagatella, Andreas Krause, Georg Martius

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.
Researcher Affiliation	Academia	Marco Bagatella EMAIL Department of Computer Science ETH Zürich, Zürich, Switzerland Andreas Krause Department of Computer Science ETH Zürich, Zürich, Switzerland Georg Martius Max Planck Institute for Intelligent Systems Tübigen, Germany
Pseudocode	Yes	Algorithm 1 DRL2
Open Source Code	Yes	code is available at github.com/marbaga/drl2.
Open Datasets	Yes	On the top of Figure 5 a simulated Fetch robotic arm (de Lazcano et al., 2023) is evaluated on two tasks [...] In the middle, an Half Cheetah receives specifications encoding, respectively, finite sequences of positions and infinite sequences of angles for its center of mass [...] On the bottom, a 12Do F simulated quadruped robot (Ray et al., 2019) is tasked with (i) fully traversing a narrow corridor, or with (ii) navigating through two zones in sequence.
Dataset Splits	No	The paper describes simulation environments and episode lengths for RL training rather than static datasets with explicit train/test/validation splits. For instance, in tabular settings: "Episodes have a duration equal to the minimum number of step to reach the final zone, plus 10 steps." and "Episodes have a fixed length of 70 steps."
Hardware Specification	Yes	Each experimental run required 8 cores of a modern CPU (Intel i7 12th Gen CPU or equivalent).
Software Dependencies	No	Our codebase mostly relies on numpy (Harris et al., 2020) for numeric computation and torch (Paszke et al., 2017) for its autograd functionality. Furthermore, we partially automate the synthesis of LDBAs from LTL formulas through rabinizer (Křetínsk y et al., 2018). While software packages are mentioned, specific version numbers for numpy, torch, and rabinizer are not provided, only citation years.
Experiment Setup	Yes	The main hyperparameter introduced by DRL2 is α, which controls the strength of its prior as described in Section 3.2. It is set to 10^3 in tabular settings, and to 10^5 in continuous settings, where increased stability was found to be beneficial. The remaining hyperparameters for reward shaping are shared with the count-based baseline and are, respectively, the frequency of updates to the potential function (set to 2000 environment steps), and a scaling coefficient to the intrinsic reward, which was tuned individually for each method in a grid of [0.1, 1.0, 10.0]. We found a coefficient of 0.1 to be optimal across all tasks for both methods in tabular settings, while continuous settings benefit from a stronger signal and a coefficient of 10.0. [...] Hyperparameters for each algorithm were tuned to perform best when no exploration bonus is being used, and are reported in Tables 1. They are kept fixed across tasks. Table 1: Hyperparameters for Q-learning and Soft Actor Critic. Hyperparameter Value γ 0.99 Buffer size 4 10^5 Batch size 64 Initial exploration 2 10^3 steps τ (Polyak) 5 10^3 for SAC Learning rate 3 10^4 for SAC, 1 for Q-learning