LTL-Constrained Policy Optimization with Cycle Experience Replay

Authors: Ameesh Shah, Cameron Voloshin, Chenxi Yang, Abhinav Verma, Swarat Chaudhuri, Sanjit A. Seshia

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Cycl ER in three continuous control domains. Our experimental results show that optimizing Cycl ER in tandem with the existing scalar reward outperforms existing reward-shaping methods at finding performant LTL-satisfying policies.
Researcher Affiliation Collaboration Ameesh Shah EMAIL UC Berkeley Cameron Voloshin EMAIL Latitude AI Chenxi Yang EMAIL UT Austin Abhinav Verma EMAIL Penn State University Swarat Chaudhuri EMAIL UT Austin Sanjit A. Seshia EMAIL UC Berkeley
Pseudocode Yes Algorithm 1: Cycle Experience Replay (Cycl ER)
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of their code. It mentions using existing tools like PPO and the Spot tool, but not their own implementation code.
Open Datasets Yes We use the Zones environment from the Mu Jo Co-based Safety-Gymnasium suite of environments (Ji et al., 2023). We use the Buttons environment, also from Safety-Gymnasium.
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits, as it primarily uses reinforcement learning in simulation environments where data is generated through agent interaction rather than predefined static datasets.
Hardware Specification Yes All experiments were done on an Intel Core i9 processor with 10 cores equipped with an NVIDIA RTX A4500 GPU.
Software Dependencies No The paper mentions using "entropy-regularized PPO" and the "Adam optimizer" but does not provide specific version numbers for these software libraries. It also mentions the "Spot tool Duret-Lutz et al. (2022)" but without a specific version number used in their experiments.
Experiment Setup Yes We provide hyperparameter choices for PPO for each experiment in Table 6 and choices for λ in Table 4. In Table 6, batch size refers to the number of trajectories. In our PPO implementation, we use a 3-layer, 64-hidden unit network as the actor using ReLU activations, and a 3-layer, 64-hidden unit network architecture with tanh activations in between layers and no final activation function for the critic.