Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms
Authors: Shalabh Bhatnagar, Deepak H R
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove the asymptotic convergence of all algorithms and show the results of several experiments on various Mu Jo Co locomotion tasks wherein we compare the performance of our algorithms with the recently proposed ARS algorithms in the literature as well as other well known algorithms namely A2C, PPO and TRPO. Our algorithms are seen to be competitive against all algorithms and in fact show the best results on a majority of experiments. 6 Numerical Results We show the results of experiments on various settings. The first set of experiments are on a simple 2D gridworld environment with different state sizes. Subsequently, we show the results of experiments on four different continuous control Mu Jo Co environments. |
| Researcher Affiliation | Academia | Shalabh Bhatnagar EMAIL Department of Computer Science and Automation, Indian Institute of Science, Bengaluru 560012, India Deepak Ramachandra EMAIL Department of Computer Science and Automation, Indian Institute of Science, Bengaluru 560012, India |
| Pseudocode | Yes | Algorithm 1 Augmented Random Search (): four versions V1, V1-t, V2 and V2-t |
| Open Source Code | Yes | 1https://github.com/deepakhr1999/smooth-functional-reinforce 2https://github.com/deepakhr1999/ARS-SFR, forked from modestyachts/ARS to implement SFR and compare with ARS. |
| Open Datasets | Yes | We show the results of experiments on various Mu Jo Co locomotion tasks wherein we compare the performance of our algorithms with the recently proposed ARS algorithms in the literature as well as other well known algorithms namely A2C, PPO and TRPO. We empirically study the performance of our algorithms along with their clipped and signed variants with the ARS algorithms on four different Mu Jo Co locomotion tasks, namely, Swimmer, Hopper, Half Cheetah and Walker2d, respectively. The gridworld environment consists of an L L grid where the agent starts in the top-left corner and aims to reach the terminal state at the bottom-right. |
| Dataset Splits | No | The paper uses simulated environments (Gridworld and Mu Jo Co locomotion tasks) where data is generated dynamically through episodes rather than being split from a fixed, pre-existing dataset. The concepts of 'max episode length' and 'max interactions limit' refer to training budgets and termination conditions for episodes, not specific train/test/validation splits of a static dataset. For instance, for Gridworld, 'averaging outcomes over 10 random seeds and running each algorithm for a fixed number of interactions' describes evaluation methodology, not dataset splitting. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as CPU or GPU models, memory, or cloud computing specifications. It only refers to the experimental environments like 'Mu Jo Co locomotion tasks' and a 'Stochastic Gridworld Environment' without mentioning the underlying computational resources. |
| Software Dependencies | No | The paper mentions using implementations from 'Ji et al. (2024)' for baseline algorithms and refers to 'Raffin et al. (2021)' for PPO, TRPO, and A2C algorithms, as well as an 'ADAM optimizer'. However, it does not specify concrete version numbers for any of these software components (e.g., Python, PyTorch, Stable-Baselines3, or specific library versions), which is necessary for a reproducible description of software dependencies. |
| Experiment Setup | Yes | C.1.2 Algorithm Parameters For the SF-REINFORCE algorithm we chose the parameters δ(n) = δ0( 1 50000+n)d and α(n) = α0 50000 50000+n , where n is the episode number. Here, we set α(0) = α0 = 2 10 6. In this setting, d < 0.5 is required for convergence. We experiment with two different schemes a decay scheme where we vary d and set δ(0) = δ0 = 1, and a constant scheme with d = 0 and with varying δ0. To reduce the variance, we measure Gn as the average over 10 trials. C.2.1 Optimal Hyperpameters [Tables 7-18 list detailed hyperparameters for ARS and SFR variants across different Mu Jo Co tasks, including delta_std, deltas_used, n_directions, step_size, and transform.] |