Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Authors: Brett Barkley, David Fridovich-Keil

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that, while DMBRL algorithms perform well in control tasks in Open AI Gym, their performance can drop significantly in Deep Mind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process the backbone of Dyna-style algorithms significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.
Researcher Affiliation Academia 1Department of Computer Science, University of Texas at Austin, Austin, TX, USA 2Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, TX, USA.
Pseudocode No The paper describes the methodology in prose, focusing on the concepts and empirical results, without providing explicit pseudocode or algorithm blocks.
Open Source Code Yes We have released our code here: https://github.com/CLe ARobotics Lab/STFL.
Open Datasets Yes This paper begins with a simple, yet novel and unexpected observation: Model-Based Policy Optimization (MBPO) (Janner et al., 2019), a popular Dyna-style (Sutton, 1991) model-based reinforcement learning (DMBRL) algorithm, demonstrates strong performance across tasks in Open AI Gym (Brockman et al., 2016), but performs significantly worse than its base off-policy algorithm, Soft Actor Critic (SAC) (Haarnoja et al., 2019a), when trained in Deep Mind Control Suite (DMC) (Tassa et al., 2020) cf. Figure 1.
Dataset Splits No The paper describes training methodologies within reinforcement learning environments (Open AI Gym and Deep Mind Control Suite) where data is generated dynamically through agent-environment interaction and stored in a replay buffer. It does not provide specific fixed training/test/validation dataset splits with percentages or sample counts, which are typical for static supervised learning datasets.
Hardware Specification Yes The total runtime for six seeds across six environments and 500k steps came out to approximately 4 days on an NVIDIA RTX A5000 GPU.
Software Dependencies No The paper mentions several software components like JAX, PyTorch, SAC, DDPG, Adam optimizer, ReLU, Swish, and Layer Normalization. However, it does not provide specific version numbers for these software dependencies, which is required for reproducible description of the ancillary software.
Experiment Setup Yes Appendix E provides 'Nominal Hyperparameters for SAC, MBPO, and ALM' in three detailed tables (Table 1, Table 2, and Table 3). These tables list specific values for a wide range of hyperparameters including Discount (γ), Warmup steps, Minibatch size, Optimizer, Learning rate, Network activation functions, Number of hidden layers, Hidden units per layer, Replay buffer size, Updates per step, Target network update period, Soft update rate (τ), Ensemble retrain interval, Synthetic ratio, Model rollouts per environment step, Number of ensemble layers, Number of elite models, Number of models in ensemble, Model horizon, Max grad norm, Latent dimension, Coefficient of classifier rewards, and Exploration stddev. schedule.