Hyperspherical Normalization for Scalable Deep Reinforcement Learning
Authors: Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using the soft actor-critic as a base algorithm, Simba V2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at dojeon-ai.github.io/Simba V2. 5. Experiments We now present a series of experiments designed to evaluate Simba V2. Our investigation centers on four main setups: Optinmization Analysis (Section 5.2). Investigate whether Simba V2 stabilizes the optimization process. Scaling Analysis (Section 5.3). Investigate whether Simba V2 allows scaling model capacity and computation. Comparisons (Sections 5.4). Compare Simba V2 against state-of-the-art RL algorithms. Design Study (Section 5.5.) Conducts ablation studies on individual architectural components of Simba V2. |
| Researcher Affiliation | Collaboration | 1KAIST 2Sony AI 3UT Austin. Correspondence to: Hojoon Lee <EMAIL>. |
| Pseudocode | Yes | Listings 1, 2 and 3 provide the Google JAX implementation of scaling vector (Section 4.4), input embedding (Section 4.1), and MLP block (Section 4.2), respectively. Listing 1. A JAX implementation of Scaler (Section 4.4) Listing 2. A JAX implementation of Input Embedding (Section 4.1). Listing 3. A JAX implementation of MLP block (Section 4.2). |
| Open Source Code | Yes | The code is available at dojeon-ai.github.io/Simba V2. |
| Open Datasets | Yes | We evaluated Simba V2 on four standard online RL benchmarks: Mu Jo Co (Todorov et al., 2012), DMC Suite (Tassa et al., 2018), Myo Suite (Caggiano et al., 2022), and Humanoid Bench (Sferrazza et al., 2024); as well as the D4RL Mu Jo Co benchmark (Fu et al., 2020) for offline RL. |
| Dataset Splits | Yes | Results are averaged over 57 continuous control tasks from Mu Jo Co, DMC, Myo Suite, and Humanoid Bench, each trained on 1 million samples. For offline RL, we simply add a behavioral cloning loss during training with using identical configurations to the online RL. Despite minimal changes, Simba V2 performs competitively with existing baselines (Appendix D). |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It discusses 'compute' generally but lacks specific models or configurations. |
| Software Dependencies | No | Appendix B provides JAX implementations of components, but does not specify a version for JAX. Appendix C mentions 'Adam' as an optimizer but does not specify a version for Adam or any other software dependencies. |
| Experiment Setup | Yes | For all experiments, we use consistent hyperparameters across benchmarks. The default settings are listed in Table 3. Table 3. Hyperparameters Table. The hyperparameters listed below are used consistently across all tasks using Simba V2, unless stated otherwise. For the discount factor γ, we set it automatically using heuristics used by TD-MPC2 (Hansen et al., 2023). Input Shift constant cshift 3.0 Output Number of return bins natoms 101 ... (numerous other hyperparameters listed in Appendix C) |