reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Concave Utility Reinforcement Learning with Zero-Constraint Violations

Authors: Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Simulation Results To validate the performance of the UC-CURL algorithm and the PS-CURL algorithm, we run the simulation on the ﬂow and service control in a single-serve queue, which was introduced in (Altman & Schwartz, 1991). Along with validating the performance of the proposed algorithms, we also compare the algorithms against the algorithms proposed in (Singh et al., 2020) and in (Chen et al., 2022) for model-based constrained reinforcement learning for inﬁnite horizon MDPs... The experiments were run on a 36 core Intel-i9 CPU @3.00 GHz with 64 GB of RAM. The result is shown in the Figure 1.
Researcher Affiliation	Academia	Mridul Agarwal EMAIL Purdue University Qinbo Bai EMAIL Purdue University Vaneet Aggarwal EMAIL Purdue University
Pseudocode	Yes	Algorithm 1 UC-CURL Parameters: K Input: S, A, r, d, ci i [d]... Algorithm 2 PS-CURL Parameters: K Input: S, A, r, d, ci i [d]
Open Source Code	No	The paper does not contain any explicit statements about providing open-source code for the methodology described.
Open Datasets	No	The paper uses a simulated environment for experiments, described as 'ﬂow and service control in a single-serve queue, which was introduced in (Altman & Schwartz, 1991)'. It specifies environment parameters and reward/cost functions within the paper (e.g., 'In the simulation, the length of the buﬀer is set as L = 5'). It does not use or provide a publicly available dataset.
Dataset Splits	No	The paper describes experiments in a simulated environment over a 'length of horizon T = 5 * 10^5' and running '50 independent simulations'. This involves online interaction with an environment rather than using pre-defined splits of a static dataset.
Hardware Specification	Yes	The experiments were run on a 36 core Intel-i9 CPU @3.00 GHz with 64 GB of RAM.
Software Dependencies	No	The paper mentions 'coded easily in CVXPY' but does not provide a specific version number for CVXPY or any other software dependencies.
Experiment Setup	Yes	In the simulation, the length of the buﬀer is set as L = 5. The service action space is set as [0.2, 0.4, 0.6, 0.8] and the ﬂow action space is set as [0.4, 0.5, 0.6, 0.7]... We use the length of horizon T = 5 * 10^5 and run 50 independent simulations of all algorithms. For our implementation, we choose the value of parameter K in Algorithm 1 as K = 1... We set the value of the learning rate θ for online mirror descent as 5 * 10^-2 with an episode length of 5 * 10^3.