Reinforcement Learning with Random Time Horizons

Authors: Enric Ribera Borrell, Lorenz Richter, Christof Schuette

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches. In multiple numerical experiments, we systematically investigate the effect of incorporating the randomness of the time horizon in the gradient computation. For most cases we can see significant performance improvements of our gradient formulas compared to the standard ones, in particular in terms of convergence speed.
Researcher Affiliation Collaboration 1Zuse Institute Berlin, 14195 Berlin, Germany 2Institute of Mathematics, Free University Berlin, 14195 Berlin, Germany 3dida Datenschmiede Gmb H, 10827 Berlin, Germany.
Pseudocode Yes We refer to Algorithms 1-4 in Appendix D for further computational details. Algorithm 1 Trajectory Policy Gradient (REINFORCE with random time horizon). Algorithm 2 State-space Policy Gradient. Algorithm 3 Trajectory and model-based Deterministic Policy Gradient. Algorithm 4 State-space and model-based Deterministic Policy Gradient.
Open Source Code Yes The code can be found at https://github.com/riberaborrell/rl-random-times.
Open Datasets Yes The mountain car problem is a classical benchmark in reinforcement learning... The reacher environment contains a two-joint robot arm...see Towers et al. (2024) for details.
Dataset Splits No The paper describes methods for sampling trajectories and experiences for training, but it does not specify explicit training/test/validation dataset splits as would be typical for static datasets. Evaluation is conducted by monitoring performance metrics over training iterations. For example, Algorithms 1-4 mention simulating 'K samples of trajectories' and sampling 'M experiences from memory' but do not detail separate sets for validation or testing.
Hardware Specification Yes We also note that each experiment requires only one CPU core, and the maximum value of allocated memory is set to 64 GB.
Software Dependencies No The paper describes the architecture of the neural networks and provides algorithms but does not mention specific software library names or their version numbers for replication.
Experiment Setup Yes For the experiment described in Section 3.1 we consider a Gaussian stochastic policy for which µ and σ are represented by a two-head neural network (see details in Appendix E.1) with L = 3 layers and d1 = d2 = 32 units. We compare the three different policy gradient formulas by implementing Algorithm 1 (trajectory PG), Algorithm 2 (state-space PG) and Algorithm 2 without estimating the Zπ-factor (state-space PG unbiased) for a batch of K = 100 trajectories, a batch of experiences containing all the information in the memory (M = 100% of the memory size), and we stop the optimization algorithm after I = 5 104 gradient iterations. The best performing learning rates for each gradient approach are λtraj = λstate = 10 4, λbiased state = 5 10 2, respectively.