Regret-Free Reinforcement Learning for Temporal Logic Specifications

Authors: R Majumdar, Mahmoud Salamati, Sadegh Soudjani

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have implemented our algorithm with empirical evaluation on reach-avoid specifications in a gridworld environment reported in App. C. In particular, we compare the performance and convergence of our algorithm to the sample complexity and convergence of a PAC-MDP algorithm for LTL (Perez et al., 2024), and show that our algorithm convergences faster within a smaller number of episodes.
Researcher Affiliation Academia MPI-SWS, Kaiserslautern, Germany University of Birmingham, Birmingham, UK.
Pseudocode Yes Algorithm 1 Regret-free algorithm for until formulas (Zero Reg) Algorithm 2 Extended value iteration (EVI) Algorithm 3 Computing optimistic MDP in Eq. (5) (Inner Max) Algorithm 4 Regret-free learning algorithm for general LTL specifications Algorithm 5 Graph learning algorithm (Graph Learn) Algorithm 6 Reachability policy synthesis for graph learning (Reach)
Open Source Code No The paper does not provide any concrete access information (e.g., repository link, explicit code release statement) for the source code.
Open Datasets No We considered a reach-avoid policy synthesis problem in the gridworld example described in Fig. 2.
Dataset Splits No The paper uses a simulated 'gridworld example' for its experiments, which is an environment rather than a dataset with explicit training/test/validation splits.
Hardware Specification Yes The experiments are performed on a laptop with core i7 CPU at 3.10GHz, with 8GB of RAM.
Software Dependencies No The paper mentions "We implement Alg. 1" but does not specify any software names or versions (e.g., Python, PyTorch, specific libraries or solvers with version numbers).
Experiment Setup Yes We set δ = 0.1 over 10 runs. Furthermore, we group all of the cells associated with the wall into an absorbing state B, such that we have |S| = 17 and |A| = 4. The considered specification is ϕ = Avoid U Goal, where G is marked as Goal and walls are marked as Avoid. ... We set pmin = 0.01.