Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning
Authors: Brett Barkley, David Fridovich-Keil
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that, while DMBRL algorithms perform well in control tasks in Open AI Gym, their performance can drop significantly in Deep Mind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process the backbone of Dyna-style algorithms significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Texas at Austin, Austin, TX, USA 2Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, TX, USA. |
| Pseudocode | No | The paper describes the methodology in prose, focusing on the concepts and empirical results, without providing explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have released our code here: https://github.com/CLe ARobotics Lab/STFL. |
| Open Datasets | Yes | This paper begins with a simple, yet novel and unexpected observation: Model-Based Policy Optimization (MBPO) (Janner et al., 2019), a popular Dyna-style (Sutton, 1991) model-based reinforcement learning (DMBRL) algorithm, demonstrates strong performance across tasks in Open AI Gym (Brockman et al., 2016), but performs significantly worse than its base off-policy algorithm, Soft Actor Critic (SAC) (Haarnoja et al., 2019a), when trained in Deep Mind Control Suite (DMC) (Tassa et al., 2020) cf. Figure 1. |
| Dataset Splits | No | The paper describes training methodologies within reinforcement learning environments (Open AI Gym and Deep Mind Control Suite) where data is generated dynamically through agent-environment interaction and stored in a replay buffer. It does not provide specific fixed training/test/validation dataset splits with percentages or sample counts, which are typical for static supervised learning datasets. |
| Hardware Specification | Yes | The total runtime for six seeds across six environments and 500k steps came out to approximately 4 days on an NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | The paper mentions several software components like JAX, PyTorch, SAC, DDPG, Adam optimizer, ReLU, Swish, and Layer Normalization. However, it does not provide specific version numbers for these software dependencies, which is required for reproducible description of the ancillary software. |
| Experiment Setup | Yes | Appendix E provides 'Nominal Hyperparameters for SAC, MBPO, and ALM' in three detailed tables (Table 1, Table 2, and Table 3). These tables list specific values for a wide range of hyperparameters including Discount (γ), Warmup steps, Minibatch size, Optimizer, Learning rate, Network activation functions, Number of hidden layers, Hidden units per layer, Replay buffer size, Updates per step, Target network update period, Soft update rate (τ), Ensemble retrain interval, Synthetic ratio, Model rollouts per environment step, Number of ensemble layers, Number of elite models, Number of models in ensemble, Model horizon, Max grad norm, Latent dimension, Coefficient of classifier rewards, and Exploration stddev. schedule. |