Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning

Authors: Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, Dacheng Tao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of ablation and scaling studies we reveal a clear and definitive answer: YES! As summarized in Figure 1, sparse networks continue to show performance gains well beyond the point where their dense counterparts hit scaling limits, demonstrating superior parameter efficiency and enhanced scalability at larger model sizes. Subsequently, Section 4 delves into why introducing sparsity can break through current scaling barriers by leveraging a range of empirical metrics as diagnostic tools. Our analysis reveals that while larger model sizes tend to induce more severe optimization pathologies, appropriate network sparsity effectively counteracts these negative effects by preventing capacity and plasticity loss (Klein et al., 2024), constraining parameter growth (Lyle et al., 2024b), enhancing simplicity bias (Lee et al., 2024), and mitigating gradient interference (Lyle et al., 2023). Furthermore, in Section 5, we extend our empirical evaluation to visual RL and streaming RL, demonstrating that the benefits of network sparsity consistently generalize across diverse RL setups.
Researcher Affiliation Academia 1Nanyang Technical University 2Mila Quebec AI Institute 3Universit e de Montr eal 4University of Oxford. Correspondence to: Li Shen <EMAIL>, Dacheng Tao <EMAIL>.
Pseudocode No The paper describes methods and calculations (e.g., Srank, dormant ratio, gradient covariance matrices) using mathematical formulas, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available at Git Hub .
Open Datasets Yes We conducted extensive experiments on several of the most challenging Deep Mind Control (DMC) (Tassa et al., 2018) tasks... We conducted visual RL experiments on DMC using image input as the observation... We conducted streaming RL experiments on two Mu Jo Co robot locomotion tasks (Todorov et al., 2012)... We conducted Atari experiments on the Atari-100k benchmark (Kaiser et al., 2020).
Dataset Splits No The paper uses standard benchmarks like Deep Mind Control (DMC) and Atari-100k, but it does not explicitly provide specific details about how the authors split these datasets into training, validation, or test sets, such as percentages, sample counts, or references to predefined splits for their specific experimental setup.
Hardware Specification No The paper mentions "This research is enabled in part by compute resources, software and technical help provided by Mila (mila.quebec)" in the Acknowledgment section. However, it does not specify any concrete hardware details such as GPU models, CPU types, or memory amounts used for the experiments.
Software Dependencies No The paper mentions various algorithms and frameworks used, such as Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG), Sim Ba architecture, Dr Q-v2, Stream AC(λ) algorithm, Dopamine, Data Efficient Rainbow (DER), and Impala-CNN architecture. However, it does not provide specific version numbers for any of these software components or underlying libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Experimental Setup. Introducing network sparsity when scaling up model size requires careful control of multiple variables in our comparative experiments, including the width and depth of both actor and critic networks, as well as their respective sparsity levels... Complete experimental details are provided in Appendix B.1. Table 2. SAC hyperparameters. The hyperparameters listed below are used consistently across all experiments in Section 3 and Section 4. For the discount factor, we follow Lee et al. (2024) using heuristics used by TD-MPC2 (Hansen et al., 2023). Table 3. DDPG hyperparameters. The hyperparameters listed below are used consistently across all experiments in Section 3 and Section 4. For the discount factor, we follow Lee et al. (2024) using heuristics used by TD-MPC2 (Hansen et al., 2023).