Leveraging Constraint Violation Signals for Action Constrained Reinforcement Learning

Authors: Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods. Section 4 Experimental Results: We evaluate our approach on four Mu Jo Co (Todorov, Erez, and Tassa 2012) continuous control environments... Reward comparisons: Evaluation returns are computed by running five episodes per random seed every 5k training steps. Figure 3 shows that our approach SAC+CVFlow achieves comparable results... Table 1: The percentage of constraint violations during RL training.
Researcher Affiliation Academia School of Computing and Information Systems, Singapore Management University. EMAIL, EMAIL
Pseudocode Yes The pseudo-code of our proposed approach to training the CV-Flows is provided in Algorithm 1. Algorithm 1: CV-Flows Pretraining Algorithm
Open Source Code Yes Code https://github.com/rlr-smu/cv-flow
Open Datasets Yes We evaluate our approach on four Mu Jo Co (Todorov, Erez, and Tassa 2012) continuous control environments: Reacher (R), Hopper (H), Walker2D (W), and Half Cheetah (HC). We evaluate our approach on four continuous control tasks with state-wise constraints: Ball1D, Ball3D, Space-Corridor, and Space-Arena, as proposed in previous work (Dalal et al. 2018).
Dataset Splits No The paper does not provide specific percentages or counts for training, validation, and test dataset splits. It mentions 'running five episodes per random seed every 5k training steps' for evaluation, which describes experiment execution rather than data partitioning.
Hardware Specification No Runtime: ... Results for timesteps per second on other tasks can be found in Figure 8 of the supplementary material, along with computing infrastructure details. The main text itself does not specify hardware details.
Software Dependencies No The paper mentions several software components like PyTorch, SAC, DDPG, and various environments, but it does not specify any version numbers for these software dependencies in the main text.
Experiment Setup No Each algorithm is trained with 10 random seeds, capped at 48 hours per run, using hyperparameters and architectures from (Kasaura et al. 2023) (details in supplementary material). The specific hyperparameter values are referred to an external paper and supplementary material, not detailed in the main text.