reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Authors: Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we validate our methods through numerical experiments. ... The experiments are conducted in two grid world environments: 8 × 8 and 20 × 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... The results shown in Figures 1 and 2 demonstrate that the significant effectiveness of VR-PDPG.
Researcher Affiliation	Collaboration	DONGHAO YING , University of California, Berkeley, USA MENGZI AMY GUO, University of California, Berkeley, USA HYUNIN LEE, University of California, Berkeley, USA YUHAO DING, Cubist Systematic Strategies, USA JAVAD LAVAEI, University of California, Berkeley, USA ZUO-JUN MAX SHEN, University of California, Berkeley, USA
Pseudocode	Yes	Algorithm 1 Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG)
Open Source Code	Yes	Our experiment codes are available at https://github.com/hyunin-lee/VR-PDPG
Open Datasets	No	We evaluate Algorithm 1 on a feasibility-constrained MDP problem, as introduced in Example 4. ... The experiments are conducted in two grid world environments: 8 × 8 and 20 × 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... where 𝜆𝑒is the occupancy measure of the reference trajectory, computed prior to conducting the experiments1.
Dataset Splits	No	The paper does not mention specific training/test/validation dataset splits. The experiments are conducted in grid world environments where agents generate trajectories through interaction, which is a common setup in reinforcement learning that does not typically involve predefined static dataset splits.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. The Reproducibility Checklist in Section G states for item (8): 'The execution environment for experiments... is described, including GPU/CPU makes and models... [no]'.
Software Dependencies	No	The paper mentions policy parameterization using a 'neural network with a single hidden layer' and refers to the 'STORM algorithm' as a concept, but it does not list any specific software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow). The Reproducibility Checklist in Section G, item (8) explicitly states '[no]' for describing the computing infrastructure, which includes software.
Experiment Setup	Yes	For 8 × 8 grid world experiment (Figure 1), we use 12 different combinations of the hyperparameters, including the constraint violation 𝑑0 {0.001, 0.005, 0.01, 0.05} and the initial step size for primal parameter 𝜂𝜃 {0.23, 0.24, 0.25}... For 20 × 20 grid world experiment (Figure 2), we use 60 different combinations of the hyperparameters, including the initial step size for the dual parameter 𝜂𝜇= {0.01, 0.02, 0.03, 0.04, 0.05}, the momentum update parameter 𝛼 {0.03, 0.05, 0.07, 0.09}, and the constant violation 𝑑0 {0.001, 0.005, 0.01}. ... Iteration number 𝑇= 10000. Trajectory length 𝐻= 14. Gamma 𝛾= 0.9. Initial dual variable 𝜇0 = 1. Dual variable interval upper bound 𝐶0 = 10 (Line 14 of Algorithm 1).