Value-Based Deep RL Scales Predictably
Authors: Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Victor Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach using three algorithms: SAC, BRO, and PQL on Deep Mind Control, Open AI gym, and Isaac Gym, when extrapolating to higher levels of data, compute, budget, or performance. [...] We run several experiments and estimate scaling trends from the results. [...] Experimental Details |
| Researcher Affiliation | Academia | 1UC Berkeley 2University of Warsaw 3CMU. Correspondence to: Oleh Rybkin <EMAIL>, Aviral Kumar <EMAIL>. [...] Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This work was done at UC Berkeley and CMU, and is not associated with Amazon. |
| Pseudocode | No | The paper describes methods and equations but does not contain any clearly labeled pseudocode or algorithm blocks formatted as such. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | Our findings apply to algorithms such as SAC, BRO, and PQL, and domains such as the Deep Mind Control Suite (DMC), Open AI Gym, and Isaac Gym. [...] On Open AI Gym (Brockman et al., 2016), we use Soft Actor Critic, a commonly used TDlearning algorithm (Haarnoja et al., 2018). We use DMC (Tassa et al., 2018), where, we utilize the state-of-the-art Bigger, Regularized, Optimistic (BRO) algorithm (Nauman et al., 2024b). [...] Finally, we test our approach with more data on Isaac Gym (Makoviychuk et al., 2021), where we use the Parallel Q-Learning (PQL) algorithm (Li et al., 2023b). |
| Dataset Splits | No | The paper focuses on reinforcement learning where agents collect data by interacting with environments (Deep Mind Control, Open AI Gym, Isaac Gym) rather than using predefined, static dataset splits for training, validation, and testing as in supervised learning. Therefore, explicit dataset splits are not applicable or provided in the conventional sense. |
| Hardware Specification | No | The paper mentions "compute support from the Berkeley Research Compute, Polish high-performance computing infrastructure, PLGrid (HPC Center: ACK Cyfronet AGH)", but it does not specify any exact GPU/CPU models, processor types, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using specific algorithms and frameworks like "Soft Actor Critic", "BRO", "PQL", and the "Sci Py package" for analysis. However, it does not provide specific version numbers for these software components or other ancillary software dependencies, which would be necessary for reproduction. |
| Experiment Setup | Yes | To understand relationships between batch size B, learning rate η, and the UTD ratio σ, we ran an extensive grid search. [...] We first run a sweep on 5 values of η, then a grid of runs with 4 values of σ and 3 values of B, and then use hyperparameter fits to run 2 more value of σ with 8 seeds per task. [...] We first run 5 values of B, 4 values of η, and 4 σ; and then use hyperparameters fits to run 2 more values of σ, with 10 seeds per task. [...] We first run 4 values of σ, 3 values of η, as well as 5 values of B, with 5 seeds per task, after which we run a second round of grid search with 7 values of σ. Further details are in Appendices B and D and Table 3. Table 3: Tested configurations (lists specific values for Updates-to-data σ, Batch size B, Learning rate η for different domains). |