Concave Utility Reinforcement Learning with Zero-Constraint Violations

Authors: Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Simulation Results To validate the performance of the UC-CURL algorithm and the PS-CURL algorithm, we run the simulation on the flow and service control in a single-serve queue, which was introduced in (Altman & Schwartz, 1991). Along with validating the performance of the proposed algorithms, we also compare the algorithms against the algorithms proposed in (Singh et al., 2020) and in (Chen et al., 2022) for model-based constrained reinforcement learning for infinite horizon MDPs... The experiments were run on a 36 core Intel-i9 CPU @3.00 GHz with 64 GB of RAM. The result is shown in the Figure 1.
Researcher Affiliation Academia Mridul Agarwal EMAIL Purdue University Qinbo Bai EMAIL Purdue University Vaneet Aggarwal EMAIL Purdue University
Pseudocode Yes Algorithm 1 UC-CURL Parameters: K Input: S, A, r, d, ci i [d]... Algorithm 2 PS-CURL Parameters: K Input: S, A, r, d, ci i [d]
Open Source Code No The paper does not contain any explicit statements about providing open-source code for the methodology described.
Open Datasets No The paper uses a simulated environment for experiments, described as 'flow and service control in a single-serve queue, which was introduced in (Altman & Schwartz, 1991)'. It specifies environment parameters and reward/cost functions within the paper (e.g., 'In the simulation, the length of the buffer is set as L = 5'). It does not use or provide a publicly available dataset.
Dataset Splits No The paper describes experiments in a simulated environment over a 'length of horizon T = 5 * 10^5' and running '50 independent simulations'. This involves online interaction with an environment rather than using pre-defined splits of a static dataset.
Hardware Specification Yes The experiments were run on a 36 core Intel-i9 CPU @3.00 GHz with 64 GB of RAM.
Software Dependencies No The paper mentions 'coded easily in CVXPY' but does not provide a specific version number for CVXPY or any other software dependencies.
Experiment Setup Yes In the simulation, the length of the buffer is set as L = 5. The service action space is set as [0.2, 0.4, 0.6, 0.8] and the flow action space is set as [0.4, 0.5, 0.6, 0.7]... We use the length of horizon T = 5 * 10^5 and run 50 independent simulations of all algorithms. For our implementation, we choose the value of parameter K in Algorithm 1 as K = 1... We set the value of the learning rate θ for online mirror descent as 5 * 10^-2 with an episode length of 5 * 10^3.