reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Authors: Brett Barkley, David Fridovich-Keil

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that, while DMBRL algorithms perform well in control tasks in Open AI Gym, their performance can drop significantly in Deep Mind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process the backbone of Dyna-style algorithms significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.
Researcher Affiliation	Academia	1Department of Computer Science, University of Texas at Austin, Austin, TX, USA 2Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, TX, USA.
Pseudocode	No	The paper describes the methodology in prose, focusing on the concepts and empirical results, without providing explicit pseudocode or algorithm blocks.
Open Source Code	Yes	We have released our code here: https://github.com/CLe ARobotics Lab/STFL.
Open Datasets	Yes	This paper begins with a simple, yet novel and unexpected observation: Model-Based Policy Optimization (MBPO) (Janner et al., 2019), a popular Dyna-style (Sutton, 1991) model-based reinforcement learning (DMBRL) algorithm, demonstrates strong performance across tasks in Open AI Gym (Brockman et al., 2016), but performs significantly worse than its base off-policy algorithm, Soft Actor Critic (SAC) (Haarnoja et al., 2019a), when trained in Deep Mind Control Suite (DMC) (Tassa et al., 2020) cf. Figure 1.
Dataset Splits	No	The paper describes training methodologies within reinforcement learning environments (Open AI Gym and Deep Mind Control Suite) where data is generated dynamically through agent-environment interaction and stored in a replay buffer. It does not provide specific fixed training/test/validation dataset splits with percentages or sample counts, which are typical for static supervised learning datasets.
Hardware Specification	Yes	The total runtime for six seeds across six environments and 500k steps came out to approximately 4 days on an NVIDIA RTX A5000 GPU.
Software Dependencies	No	The paper mentions several software components like JAX, PyTorch, SAC, DDPG, Adam optimizer, ReLU, Swish, and Layer Normalization. However, it does not provide specific version numbers for these software dependencies, which is required for reproducible description of the ancillary software.
Experiment Setup	Yes	Appendix E provides 'Nominal Hyperparameters for SAC, MBPO, and ALM' in three detailed tables (Table 1, Table 2, and Table 3). These tables list specific values for a wide range of hyperparameters including Discount (γ), Warmup steps, Minibatch size, Optimizer, Learning rate, Network activation functions, Number of hidden layers, Hidden units per layer, Replay buffer size, Updates per step, Target network update period, Soft update rate (τ), Ensemble retrain interval, Synthetic ratio, Model rollouts per environment step, Number of ensemble layers, Number of elite models, Number of models in ensemble, Model horizon, Max grad norm, Latent dimension, Coefficient of classifier rewards, and Exploration stddev. schedule.