reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Authors: Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Section 9, we numerically validate the proposed algorithm. Here, we analyze the behavior of PES and SL-PG in both AB and PB explorations, comparing them with their static stochasticity counterparts (GPOMDP and PGPE).We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200.
Researcher Affiliation	Academia	1Politecnico di Milano, Piazza Leonardo Da Vinci 32, 20133, Milan, Italy. Correspondence to: Alessandro Montenegro <EMAIL>.
Pseudocode	Yes	Algorithm 1 PES. Input :Number of phases P, Iterations per phase p Kiq P i 1, Initial parameter θ, Stochasticity schedule pσiq P i 1, Learning rate schedule pζiq P i 1, Batch size N Initialize θ0 ÐÝ θ for p P JPK do θp ÐÝ Run for Kp iterations a PB or AB PG from θp 1, with fixed stochasticity σp, learning rate ζp, batch size N end Return θP
Open Source Code	Yes	The code is available at https://github.com/Montenegro Alessandro/Magic RL.
Open Datasets	Yes	We conduct the evaluations in the Swimmer-v5 environment, part of the Mu Jo Co (Todorov et al., 2012) control suite, using a horizon of T 200.
Dataset Splits	No	The paper specifies experimental parameters such as "Batch size N 100" and "a horizon of T 200" for the reinforcement learning environments. However, it does not explicitly describe traditional training, validation, or test dataset splits in the context of static datasets, which is common in supervised learning. In reinforcement learning, data is generated through interaction with the environment, rather than being pre-split from a fixed dataset.
Hardware Specification	Yes	All the experiments were run on a 2019 16-inches Mac Book Pro. The machine was equipped as follows: CPU RAM GPU Intel Core i7 (6 cores, 2.6 GHz) 16 GB 2667 MHz DDR4 Intel UHD Graphics 630 1536 MB
Software Dependencies	No	The paper mentions that "All learning rates are managed by the Adam (Kingma & Ba, 2014) optimizer." While Adam is a specific optimizer, the paper does not provide version numbers for any software libraries, frameworks (e.g., Python, PyTorch, TensorFlow), or other dependencies used in the implementation.
Experiment Setup	Yes	For both PB and AB, we present PES with two different schedules, both starting with σ 1. The first (A) schedule consists of P 25 phases, each lasting Kp 200 iterations, with a schedule exponent of y 1. The second (B) schedule includes P 5000 phases, each lasting Kp 1 iteration, with a schedule exponent of y 0.5. SL-PG is executed for K 5000 iterations, using the common exponential parameterization for σ (i.e., σ eξ). The static stochasticity counterparts are also run for K 5000 iterations, employing stochasticity levels σ P t1, 0.5, 0.04, 0.014u.