Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction
Authors: Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we validate our methods through numerical experiments. ... The experiments are conducted in two grid world environments: 8 Γ 8 and 20 Γ 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... The results shown in Figures 1 and 2 demonstrate that the significant effectiveness of VR-PDPG. |
| Researcher Affiliation | Collaboration | DONGHAO YING , University of California, Berkeley, USA MENGZI AMY GUO, University of California, Berkeley, USA HYUNIN LEE, University of California, Berkeley, USA YUHAO DING, Cubist Systematic Strategies, USA JAVAD LAVAEI, University of California, Berkeley, USA ZUO-JUN MAX SHEN, University of California, Berkeley, USA |
| Pseudocode | Yes | Algorithm 1 Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG) |
| Open Source Code | Yes | Our experiment codes are available at https://github.com/hyunin-lee/VR-PDPG |
| Open Datasets | No | We evaluate Algorithm 1 on a feasibility-constrained MDP problem, as introduced in Example 4. ... The experiments are conducted in two grid world environments: 8 Γ 8 and 20 Γ 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... where ππis the occupancy measure of the reference trajectory, computed prior to conducting the experiments1. |
| Dataset Splits | No | The paper does not mention specific training/test/validation dataset splits. The experiments are conducted in grid world environments where agents generate trajectories through interaction, which is a common setup in reinforcement learning that does not typically involve predefined static dataset splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. The Reproducibility Checklist in Section G states for item (8): 'The execution environment for experiments... is described, including GPU/CPU makes and models... [no]'. |
| Software Dependencies | No | The paper mentions policy parameterization using a 'neural network with a single hidden layer' and refers to the 'STORM algorithm' as a concept, but it does not list any specific software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow). The Reproducibility Checklist in Section G, item (8) explicitly states '[no]' for describing the computing infrastructure, which includes software. |
| Experiment Setup | Yes | For 8 Γ 8 grid world experiment (Figure 1), we use 12 different combinations of the hyperparameters, including the constraint violation π0 {0.001, 0.005, 0.01, 0.05} and the initial step size for primal parameter ππ {0.23, 0.24, 0.25}... For 20 Γ 20 grid world experiment (Figure 2), we use 60 different combinations of the hyperparameters, including the initial step size for the dual parameter ππ= {0.01, 0.02, 0.03, 0.04, 0.05}, the momentum update parameter πΌ {0.03, 0.05, 0.07, 0.09}, and the constant violation π0 {0.001, 0.005, 0.01}. ... Iteration number π= 10000. Trajectory length π»= 14. Gamma πΎ= 0.9. Initial dual variable π0 = 1. Dual variable interval upper bound πΆ0 = 10 (Line 14 of Algorithm 1). |