LTL-Constrained Policy Optimization with Cycle Experience Replay
Authors: Ameesh Shah, Cameron Voloshin, Chenxi Yang, Abhinav Verma, Swarat Chaudhuri, Sanjit A. Seshia
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Cycl ER in three continuous control domains. Our experimental results show that optimizing Cycl ER in tandem with the existing scalar reward outperforms existing reward-shaping methods at finding performant LTL-satisfying policies. |
| Researcher Affiliation | Collaboration | Ameesh Shah EMAIL UC Berkeley Cameron Voloshin EMAIL Latitude AI Chenxi Yang EMAIL UT Austin Abhinav Verma EMAIL Penn State University Swarat Chaudhuri EMAIL UT Austin Sanjit A. Seshia EMAIL UC Berkeley |
| Pseudocode | Yes | Algorithm 1: Cycle Experience Replay (Cycl ER) |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of their code. It mentions using existing tools like PPO and the Spot tool, but not their own implementation code. |
| Open Datasets | Yes | We use the Zones environment from the Mu Jo Co-based Safety-Gymnasium suite of environments (Ji et al., 2023). We use the Buttons environment, also from Safety-Gymnasium. |
| Dataset Splits | No | The paper does not provide explicit training/test/validation dataset splits, as it primarily uses reinforcement learning in simulation environments where data is generated through agent interaction rather than predefined static datasets. |
| Hardware Specification | Yes | All experiments were done on an Intel Core i9 processor with 10 cores equipped with an NVIDIA RTX A4500 GPU. |
| Software Dependencies | No | The paper mentions using "entropy-regularized PPO" and the "Adam optimizer" but does not provide specific version numbers for these software libraries. It also mentions the "Spot tool Duret-Lutz et al. (2022)" but without a specific version number used in their experiments. |
| Experiment Setup | Yes | We provide hyperparameter choices for PPO for each experiment in Table 6 and choices for λ in Table 4. In Table 6, batch size refers to the number of trajectories. In our PPO implementation, we use a 3-layer, 64-hidden unit network as the actor using ReLU activations, and a 3-layer, 64-hidden unit network architecture with tanh activations in between layers and no final activation function for the critic. |