A limitation on black-box dynamics approaches to Reinforcement Learning
Authors: Brieuc Pinon, Raphael Jungers, Jean-Charles Delvenne
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform numerical experiments and show that a planning algorithm leveraging a learned model of the dynamics also efficiently solves the problems in the family. Our theoretical and numerical results suggest that some ideas present in them could help solve problems otherwise intractable for a large class of classical RL methods. We illustrate and confirm numerically how practical deep RL methods perform on the family of RL problems constructed in the proof of Theorem 4.1. The results are presented in Figure 3. |
| Researcher Affiliation | Academia | Brieuc Pinon EMAIL Department of Mathematical Engineering UCLouvain. Raphaël Jungers raphaëlEMAIL Department of Mathematical Engineering UCLouvain. Jean-Charles Delvenne EMAIL Department of Mathematical Engineering UCLouvain. |
| Pseudocode | Yes | Algorithm 1 Encoder and decoder of states. Algorithm 2 Env : an interface linked to an RL problem defined by the transition operator P. Algorithm 3 A goal-conditioned algorithm. Algorithm 4 Env : an interface linked to a RL problem defined by the transition operator P and a G-profile ϕG (Definition B.1). Algorithm 5 A second goal-conditioned algorithm. Algorithm 6 Fitted Q-iteration (with ϵ-greedy exploration) Algorithm 7 Fitted Q-iteration implemented with the interface defined in Algorithm 2. Algorithm 8 Deep Policy Gradient with Value function learning (Actor-Critic). Algorithm 9 Deep Policy Gradient with Value function learning (Actor-Critic) implemented with the interface defined in Algorithm 2. Algorithm 10 Tree-search implementation using the interface defined in Algorithm 2. Algorithm 11 Alpha Zero planning functions. Algorithm 12 Alpha Zero. Algorithm 13 Alpha Zero planning functions implemented with the interface defined in Algorithm 2, fitting Definition 3.2. Algorithm 14 Alpha Zero implemented with the interface defined in Algorithm 2, fitting Definition 3.2. Algorithm 15 Neural network training for a neural network with parameters θ = (W Rm n, θ Rk) for some n, m, k N, which is interpreted in nnθ(x) = qθ (Wx) for q that outputs a real and is smooth w.r.t. θ and its input. Algorithm 16 A planning algorithm. Algorithm 17 Neural goal-conditioned algorithm. Algorithm 18 Fitted Q-iteration Algorithm 19 Proximal Policy Optimization |
| Open Source Code | No | The paper does not provide concrete access to source code. While it describes implementation details in Appendix G and refers to |
| Open Datasets | No | The paper does not provide concrete access information for a publicly available or open dataset. Instead, the authors explicitly state: 'We explicitly construct a family of RL problems satisfying the requirements in Theorem 4.1.' |
| Dataset Splits | No | The paper describes sampling strategies for trajectories within its constructed environments, rather than traditional dataset splits from an existing dataset. For example: 'For the fitted Q-iteration and PPO methods, we sample 50 times 1000 trajectories during one training run.' and 'We sample a dataset of 1000 trajectories for the goal-conditioned method.' This is not specific dataset split information. |
| Hardware Specification | No | We measured that it takes less than two days with a single GPU to reproduce the results provided in the numerical section with only one test by experiment (pair of horizon and method). |
| Software Dependencies | No | The paper mentions 'Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.' as a solver, but does not specify a version number for Gurobi itself, nor does it list versions for other key software components like deep learning frameworks or Python. |
| Experiment Setup | Yes | The goal-conditioned algorithm was run with 2 hidden layers of 256 units each, parameter I = 1000, 30000 steps of Adam W with learning rate 2 10 2, weight decay 10 3, and 100 batches. The fitted Q-iteration algorithm was run with 2 hidden layers of 512 units each, for K = 50 iterations, with I = 1000 sampled trajectories by iteration, ϵ = 0.2, 1000 steps of Adam W by iteration, learning rate 1 10 3, weigh decay 10 2, each 1000 trajectories samples were divided into 20 batches. The PPO algorithm was run with 2 hidden layers of 512 units each, for K = 50 iterations, with I = 1000 trajectories sampled by iterations, clipping parameter ϵ = 0.1, β = 10 3, 500 steps of Adam W by iteration, learning rate of 2 10 4, weight decay 10 8 and 10 batches of the sampled trajectories. The Alpha Zero algorithm was run with 2 layers of 256 neurons each, Adam W with learning rate 2 10 3 and weight decay 1 10 3. The algorithm performed the following operation three times: sampling a dataset of 1000 trajectories, then optimize an actor-critic for 30.000 steps with Adam W. For the local search procedure, we fixed a budget of 50 search node per decision and tuned the c and τ parameters (see Algorithm 12). |