Leveraging Constraint Violation Signals for Action Constrained Reinforcement Learning
Authors: Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods. Section 4 Experimental Results: We evaluate our approach on four Mu Jo Co (Todorov, Erez, and Tassa 2012) continuous control environments... Reward comparisons: Evaluation returns are computed by running five episodes per random seed every 5k training steps. Figure 3 shows that our approach SAC+CVFlow achieves comparable results... Table 1: The percentage of constraint violations during RL training. |
| Researcher Affiliation | Academia | School of Computing and Information Systems, Singapore Management University. EMAIL, EMAIL |
| Pseudocode | Yes | The pseudo-code of our proposed approach to training the CV-Flows is provided in Algorithm 1. Algorithm 1: CV-Flows Pretraining Algorithm |
| Open Source Code | Yes | Code https://github.com/rlr-smu/cv-flow |
| Open Datasets | Yes | We evaluate our approach on four Mu Jo Co (Todorov, Erez, and Tassa 2012) continuous control environments: Reacher (R), Hopper (H), Walker2D (W), and Half Cheetah (HC). We evaluate our approach on four continuous control tasks with state-wise constraints: Ball1D, Ball3D, Space-Corridor, and Space-Arena, as proposed in previous work (Dalal et al. 2018). |
| Dataset Splits | No | The paper does not provide specific percentages or counts for training, validation, and test dataset splits. It mentions 'running five episodes per random seed every 5k training steps' for evaluation, which describes experiment execution rather than data partitioning. |
| Hardware Specification | No | Runtime: ... Results for timesteps per second on other tasks can be found in Figure 8 of the supplementary material, along with computing infrastructure details. The main text itself does not specify hardware details. |
| Software Dependencies | No | The paper mentions several software components like PyTorch, SAC, DDPG, and various environments, but it does not specify any version numbers for these software dependencies in the main text. |
| Experiment Setup | No | Each algorithm is trained with 10 random seeds, capped at 48 hours per run, using hyperparameters and architectures from (Kasaura et al. 2023) (details in supplementary material). The specific hyperparameter values are referred to an external paper and supplementary material, not detailed in the main text. |