Policy-based Primal-Dual Methods for Concave CMDP with Variance Reduction

Authors: Donghao Ying, Mengzi Amy Guo, Hyunin Lee, Yuhao Ding, Javad Lavaei, Zuo-Jun Max Shen

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we validate our methods through numerical experiments. ... The experiments are conducted in two grid world environments: 8 Γ— 8 and 20 Γ— 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... The results shown in Figures 1 and 2 demonstrate that the significant effectiveness of VR-PDPG.
Researcher Affiliation Collaboration DONGHAO YING , University of California, Berkeley, USA MENGZI AMY GUO, University of California, Berkeley, USA HYUNIN LEE, University of California, Berkeley, USA YUHAO DING, Cubist Systematic Strategies, USA JAVAD LAVAEI, University of California, Berkeley, USA ZUO-JUN MAX SHEN, University of California, Berkeley, USA
Pseudocode Yes Algorithm 1 Variance-Reduced Primal-Dual Policy Gradient Algorithm (VR-PDPG)
Open Source Code Yes Our experiment codes are available at https://github.com/hyunin-lee/VR-PDPG
Open Datasets No We evaluate Algorithm 1 on a feasibility-constrained MDP problem, as introduced in Example 4. ... The experiments are conducted in two grid world environments: 8 Γ— 8 and 20 Γ— 20, where the agent starts at the upper left corner (marked as S in Figure 1a) and aims to reach a goal located at the right bottom corner (marked as G in Figure 1a). ... where πœ†π‘’is the occupancy measure of the reference trajectory, computed prior to conducting the experiments1.
Dataset Splits No The paper does not mention specific training/test/validation dataset splits. The experiments are conducted in grid world environments where agents generate trajectories through interaction, which is a common setup in reinforcement learning that does not typically involve predefined static dataset splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. The Reproducibility Checklist in Section G states for item (8): 'The execution environment for experiments... is described, including GPU/CPU makes and models... [no]'.
Software Dependencies No The paper mentions policy parameterization using a 'neural network with a single hidden layer' and refers to the 'STORM algorithm' as a concept, but it does not list any specific software libraries or frameworks with their version numbers (e.g., Python, PyTorch, TensorFlow). The Reproducibility Checklist in Section G, item (8) explicitly states '[no]' for describing the computing infrastructure, which includes software.
Experiment Setup Yes For 8 Γ— 8 grid world experiment (Figure 1), we use 12 different combinations of the hyperparameters, including the constraint violation 𝑑0 {0.001, 0.005, 0.01, 0.05} and the initial step size for primal parameter πœ‚πœƒ {0.23, 0.24, 0.25}... For 20 Γ— 20 grid world experiment (Figure 2), we use 60 different combinations of the hyperparameters, including the initial step size for the dual parameter πœ‚πœ‡= {0.01, 0.02, 0.03, 0.04, 0.05}, the momentum update parameter 𝛼 {0.03, 0.05, 0.07, 0.09}, and the constant violation 𝑑0 {0.001, 0.005, 0.01}. ... Iteration number 𝑇= 10000. Trajectory length 𝐻= 14. Gamma 𝛾= 0.9. Initial dual variable πœ‡0 = 1. Dual variable interval upper bound 𝐢0 = 10 (Line 14 of Algorithm 1).