Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Authors: Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 9, we numerically validate the proposed algorithm. Here, we analyze the behavior of PES and SL-PG in both AB and PB explorations, comparing them with their static stochasticity counterparts (GPOMDP and PGPE).We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200.
Researcher Affiliation Academia 1Politecnico di Milano, Piazza Leonardo Da Vinci 32, 20133, Milan, Italy. Correspondence to: Alessandro Montenegro <EMAIL>.
Pseudocode Yes Algorithm 1 PES. Input :Number of phases P, Iterations per phase p Kiq P i 1, Initial parameter θ, Stochasticity schedule pσiq P i 1, Learning rate schedule pζiq P i 1, Batch size N Initialize θ0 ÐÝ θ for p P JPK do θp ÐÝ Run for Kp iterations a PB or AB PG from θp 1, with fixed stochasticity σp, learning rate ζp, batch size N end Return θP
Open Source Code Yes The code is available at https://github.com/Montenegro Alessandro/Magic RL.
Open Datasets Yes We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200.
Dataset Splits No The paper specifies experimental parameters such as "Batch size N 100" and "a horizon of T 200" for the reinforcement learning environments. However, it does not explicitly describe traditional training, validation, or test dataset splits in the context of static datasets, which is common in supervised learning. In reinforcement learning, data is generated through interaction with the environment, rather than being pre-split from a fixed dataset.
Hardware Specification Yes All the experiments were run on a 2019 16-inches Mac Book Pro. The machine was equipped as follows: CPU RAM GPU Intel Core i7 (6 cores, 2.6 GHz) 16 GB 2667 MHz DDR4 Intel UHD Graphics 630 1536 MB
Software Dependencies No The paper mentions that "All learning rates are managed by the Adam (Kingma & Ba, 2014) optimizer." While Adam is a specific optimizer, the paper does not provide version numbers for any software libraries, frameworks (e.g., Python, PyTorch, TensorFlow), or other dependencies used in the implementation.
Experiment Setup Yes For both PB and AB, we present PES with two different schedules, both starting with σ 1. The first (A) schedule consists of P 25 phases, each lasting Kp 200 iterations, with a schedule exponent of y 1. The second (B) schedule includes P 5000 phases, each lasting Kp 1 iteration, with a schedule exponent of y 0.5. SL-PG is executed for K 5000 iterations, using the common exponential parameterization for σ (i.e., σ eξ). The static stochasticity counterparts are also run for K 5000 iterations, employing stochasticity levels σ P t1, 0.5, 0.04, 0.014u.