Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs
Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We evaluate the proposed ARAM in various ACRL benchmarks, including the Mu Jo Co locomotion tasks and resource allocation of communication networks and bike sharing systems. The experimental results show that: (i) ARAM enjoys faster learning progress than the state-of-the-art ACRL methods... |
| Researcher Affiliation | Academia | 1National Yang Ming Chiao Tung University, Hsinchu, Taiwan 2National Taiwan University, Taipei, Taiwan EMAIL |
| Pseudocode | Yes | Algorithm 1: Practical Implementation of ARAM |
| Open Source Code | Yes | We have made the source code publicly available* to encourage further research in this direction. *https://github.com/NYCU-RL-Bandits-Lab/ARAM |
| Open Datasets | Yes | We evaluate the algorithms in various benchmark domains widely used in the ACRL literature (Lin et al., 2021; Kasaura et al., 2023; Brahmanage et al., 2023): (i) Mu Jo Co locomotion tasks (Todorov et al., 2012): These tasks involve training robots to achieve specified goals... (ii) Resource allocation for networked systems: These tasks involve properly allocating resource under capacity constraints, including NSFnet and Bike Sharing System (BSS) (Ghosh & Varakantham, 2017). For NSFnet, ... we use the open-source network simulator from PCCRL (Jay et al., 2019). |
| Dataset Splits | No | The paper describes the configuration of various environments (MuJoCo, NSFnet, BSS3z, BSS5z) used for reinforcement learning training and evaluation. For instance, for BSS3z, it specifies 'n = 3 and m = 90 and each station has a capacity of 40 bikes'. However, it does not provide explicit details about how any dataset (if conceptually treated as such for RL) is split into distinct training, validation, or testing sets in the traditional supervised learning sense. |
| Hardware Specification | No | To ensure fair measurements of wall clock time, we run each algorithm independently using the same computing device. We also thank the National Center for High-performance Computing (NCHC) for providing computational and storage resources. The paper mentions using a 'computing device' and 'computational and storage resources' from NCHC but does not specify any particular GPU or CPU models, or detailed computer specifications used for the experiments. |
| Software Dependencies | No | Our implementation is based on Q-Pensieve (Hung et al., 2023), This work adopts SAC (Haarnoja et al., 2018) as the backbone to showcase how to integrate the proposed modifications into an existing deep RL algorithm. We use the experimental settings and objectives provided by Open AI Gym V3 to control the agents in these environments. For NSFnet, ... we use the open-source network simulator from PCCRL (Jay et al., 2019). Table 7 lists 'Optimizer Adam'. While specific software components and libraries are mentioned, none of them include explicit version numbers, which are required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | Table 7: This table provides an overview of the hyperparameters used in ARAM. Parameter ARAM Optimizer Adam Learning Rate 0.0003 Discount Factor 0.99 Replay Buffer Size 1000000 Number of Hidden Units per Layer [256, 256] Number of Samples per Minibatch 256 Nonlinearity Re LU Target Smoothing Coefficient 0.005 Target Update Interval 1 Gradient Steps 1 Sample Ratio for Augmented Replay Buffer (η) 0.2 Decay Interval for η 10,000 Decay Factor for η 0.9 |