On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method
Authors: Junyu Zhang, Chengzhuo Ni, zheng Yu, Csaba Szepesvari, Mengdi Wang
NeurIPS 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Numerical Experiments In this experiment, we aim to evaluate the performance of the TSIVR-PG algorithm for maximizing the cumulative sum of reward. As the benchmarks, we also implement the SVRPG [49], the SRVRPG [48], the HSPGA [33], and the REINFORCE [47] algorithms. Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8], |
| Researcher Affiliation | Academia | Junyu Zhang Department of Industrial Systems Engineering and Management National University of Singapore Singapore, 119077 EMAIL Chengzhuo Ni Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 EMAIL Zheng Yu Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 EMAIL Csaba Szepesvari Department of Computer Science University of Alberta Edmonton, Alberta, Canada T6G 2E8 EMAIL Mengdi Wang Department of Electrical and Computer Engineering Princeton University Princeton, NJ, 08544 EMAIL |
| Pseudocode | Yes | Algorithm 1: The TSIVR-PG Algorithm |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Our experiment is performed on benchmark RL environments including the Frozen Lake, Acrobot and Cartpole that are available from Open AI gym [8], which is a well-known toolkit for developing and comparing reinforcement learning algorithms. |
| Dataset Splits | No | The paper describes using standard RL environments but does not provide specific details on train/validation/test dataset splits, percentages, or methodologies for partitioning data to reproduce the experiment's data partitioning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI gym [8]' but does not provide specific version numbers for this or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | For all the algorithms, their batch sizes are chosen according to their theory. In details, let ϵ be any target accuracy. For both TSIVR-PG and SRVR-PG, we set N = Θ(ϵ 2), B = m = Θ(ϵ 1). For SVRPG, we set N = Θ(ϵ 2), B = Θ(ϵ 4/3) and m = Θ(ϵ 2/3). For HSPGA, we set B = Θ(ϵ 1), other parameters are calculated according to formulas in [33] given B. For REINFORCE, we set the batchsize to be N = Θ(ϵ 2). The parameter ε and the stepsize/learning rate are tuned for each individual algorithm using a grid search. For both environments, we use a neural network with two hidden layers with width 64 for both layers to model the policy. We choose σ = 0.125 in our experiment. |