reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Learning with Random Time Horizons

Authors: Enric Ribera Borrell, Lorenz Richter, Christof Schuette

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches. In multiple numerical experiments, we systematically investigate the effect of incorporating the randomness of the time horizon in the gradient computation. For most cases we can see significant performance improvements of our gradient formulas compared to the standard ones, in particular in terms of convergence speed.
Researcher Affiliation	Collaboration	1Zuse Institute Berlin, 14195 Berlin, Germany 2Institute of Mathematics, Free University Berlin, 14195 Berlin, Germany 3dida Datenschmiede Gmb H, 10827 Berlin, Germany.
Pseudocode	Yes	We refer to Algorithms 1-4 in Appendix D for further computational details. Algorithm 1 Trajectory Policy Gradient (REINFORCE with random time horizon). Algorithm 2 State-space Policy Gradient. Algorithm 3 Trajectory and model-based Deterministic Policy Gradient. Algorithm 4 State-space and model-based Deterministic Policy Gradient.
Open Source Code	Yes	The code can be found at https://github.com/riberaborrell/rl-random-times.
Open Datasets	Yes	The mountain car problem is a classical benchmark in reinforcement learning... The reacher environment contains a two-joint robot arm...see Towers et al. (2024) for details.
Dataset Splits	No	The paper describes methods for sampling trajectories and experiences for training, but it does not specify explicit training/test/validation dataset splits as would be typical for static datasets. Evaluation is conducted by monitoring performance metrics over training iterations. For example, Algorithms 1-4 mention simulating 'K samples of trajectories' and sampling 'M experiences from memory' but do not detail separate sets for validation or testing.
Hardware Specification	Yes	We also note that each experiment requires only one CPU core, and the maximum value of allocated memory is set to 64 GB.
Software Dependencies	No	The paper describes the architecture of the neural networks and provides algorithms but does not mention specific software library names or their version numbers for replication.
Experiment Setup	Yes	For the experiment described in Section 3.1 we consider a Gaussian stochastic policy for which µ and σ are represented by a two-head neural network (see details in Appendix E.1) with L = 3 layers and d1 = d2 = 32 units. We compare the three different policy gradient formulas by implementing Algorithm 1 (trajectory PG), Algorithm 2 (state-space PG) and Algorithm 2 without estimating the Zπ-factor (state-space PG unbiased) for a batch of K = 100 trajectories, a batch of experiences containing all the information in the memory (M = 100% of the memory size), and we stop the optimization algorithm after I = 5 104 gradient iterations. The best performing learning rates for each gradient approach are λtraj = λstate = 10 4, λbiased state = 5 10 2, respectively.