Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity
Authors: Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 9, we numerically validate the proposed algorithm. Here, we analyze the behavior of PES and SL-PG in both AB and PB explorations, comparing them with their static stochasticity counterparts (GPOMDP and PGPE).We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200. |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Piazza Leonardo Da Vinci 32, 20133, Milan, Italy. Correspondence to: Alessandro Montenegro <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 PES. Input :Number of phases P, Iterations per phase p Kiq P i 1, Initial parameter θ, Stochasticity schedule pσiq P i 1, Learning rate schedule pζiq P i 1, Batch size N Initialize θ0 ÐÝ θ for p P JPK do θp ÐÝ Run for Kp iterations a PB or AB PG from θp 1, with fixed stochasticity σp, learning rate ζp, batch size N end Return θP |
| Open Source Code | Yes | The code is available at https://github.com/Montenegro Alessandro/Magic RL. |
| Open Datasets | Yes | We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200. |
| Dataset Splits | No | The paper specifies experimental parameters such as "Batch size N 100" and "a horizon of T 200" for the reinforcement learning environments. However, it does not explicitly describe traditional training, validation, or test dataset splits in the context of static datasets, which is common in supervised learning. In reinforcement learning, data is generated through interaction with the environment, rather than being pre-split from a fixed dataset. |
| Hardware Specification | Yes | All the experiments were run on a 2019 16-inches Mac Book Pro. The machine was equipped as follows: CPU RAM GPU Intel Core i7 (6 cores, 2.6 GHz) 16 GB 2667 MHz DDR4 Intel UHD Graphics 630 1536 MB |
| Software Dependencies | No | The paper mentions that "All learning rates are managed by the Adam (Kingma & Ba, 2014) optimizer." While Adam is a specific optimizer, the paper does not provide version numbers for any software libraries, frameworks (e.g., Python, PyTorch, TensorFlow), or other dependencies used in the implementation. |
| Experiment Setup | Yes | For both PB and AB, we present PES with two different schedules, both starting with σ 1. The first (A) schedule consists of P 25 phases, each lasting Kp 200 iterations, with a schedule exponent of y 1. The second (B) schedule includes P 5000 phases, each lasting Kp 1 iteration, with a schedule exponent of y 0.5. SL-PG is executed for K 5000 iterations, using the common exponential parameterization for σ (i.e., σ eξ). The static stochasticity counterparts are also run for K 5000 iterations, employing stochasticity levels σ P t1, 0.5, 0.04, 0.014u. |