reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We evaluate the proposed ARAM in various ACRL benchmarks, including the Mu Jo Co locomotion tasks and resource allocation of communication networks and bike sharing systems. The experimental results show that: (i) ARAM enjoys faster learning progress than the state-of-the-art ACRL methods...
Researcher Affiliation	Academia	1National Yang Ming Chiao Tung University, Hsinchu, Taiwan 2National Taiwan University, Taipei, Taiwan EMAIL
Pseudocode	Yes	Algorithm 1: Practical Implementation of ARAM
Open Source Code	Yes	We have made the source code publicly available* to encourage further research in this direction. *https://github.com/NYCU-RL-Bandits-Lab/ARAM
Open Datasets	Yes	We evaluate the algorithms in various benchmark domains widely used in the ACRL literature (Lin et al., 2021; Kasaura et al., 2023; Brahmanage et al., 2023): (i) Mu Jo Co locomotion tasks (Todorov et al., 2012): These tasks involve training robots to achieve specified goals... (ii) Resource allocation for networked systems: These tasks involve properly allocating resource under capacity constraints, including NSFnet and Bike Sharing System (BSS) (Ghosh & Varakantham, 2017). For NSFnet, ... we use the open-source network simulator from PCCRL (Jay et al., 2019).
Dataset Splits	No	The paper describes the configuration of various environments (MuJoCo, NSFnet, BSS3z, BSS5z) used for reinforcement learning training and evaluation. For instance, for BSS3z, it specifies 'n = 3 and m = 90 and each station has a capacity of 40 bikes'. However, it does not provide explicit details about how any dataset (if conceptually treated as such for RL) is split into distinct training, validation, or testing sets in the traditional supervised learning sense.
Hardware Specification	No	To ensure fair measurements of wall clock time, we run each algorithm independently using the same computing device. We also thank the National Center for High-performance Computing (NCHC) for providing computational and storage resources. The paper mentions using a 'computing device' and 'computational and storage resources' from NCHC but does not specify any particular GPU or CPU models, or detailed computer specifications used for the experiments.
Software Dependencies	No	Our implementation is based on Q-Pensieve (Hung et al., 2023), This work adopts SAC (Haarnoja et al., 2018) as the backbone to showcase how to integrate the proposed modifications into an existing deep RL algorithm. We use the experimental settings and objectives provided by Open AI Gym V3 to control the agents in these environments. For NSFnet, ... we use the open-source network simulator from PCCRL (Jay et al., 2019). Table 7 lists 'Optimizer Adam'. While specific software components and libraries are mentioned, none of them include explicit version numbers, which are required for a reproducible description of software dependencies.
Experiment Setup	Yes	Table 7: This table provides an overview of the hyperparameters used in ARAM. Parameter ARAM Optimizer Adam Learning Rate 0.0003 Discount Factor 0.99 Replay Buffer Size 1000000 Number of Hidden Units per Layer [256, 256] Number of Samples per Minibatch 256 Nonlinearity Re LU Target Smoothing Coefficient 0.005 Target Update Interval 1 Gradient Steps 1 Sample Ratio for Augmented Replay Buffer (η) 0.2 Decay Interval for η 10,000 Decay Factor for η 0.9