Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning
Authors: Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin Yang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance and the robustness of Hyper in this section, with the comparison to baselines with different strategies of balancing exploration & exploitation. We implement all algorithms with TD3 (Fujimoto et al., 2018) as the reinforcement learning algorithm and with Disagreement (Pathak et al., 2019) as the intrinsic reward when curiosity-driven exploration is used. Specifically, we consider the following baselines: TD3 (Fujimoto et al., 2018), Curiosity-Driven Exploration (Curiosity), Decoupled Reinforcement Learning (Sch afer et al., 2021) (Decouple). All baselines are evaluated on various environments that differ in exploration difficulty, exploitation difficulty, and function approximation difficulty. Figure 5 depicts the performance of agents in the continuous goal-searching tasks (Fu et al., 2020) and locomotion tasks (Todorov et al., 2012) averaged over five trials, the shaded area represents the empirical standard deviation. The full results of the performance comparison and environment setup are deferred to the appendix. We now present the performance of Hyper and Curiosity with different intrinsic coefficients β. We evaluate the final performance of Curiosity, Decouple, and Hyper agent over multiple trials with each different value of β choice on five environments. As shown in Figure 6, the Curiosity agent shows peak performance with different values of β. |
| Researcher Affiliation | Academia | 1Department of Electrical & Computer Engineering, University of California, Los Angeles, California, USA 2Terasaki Institute for Biomedical Innovation, Los Angeles, California, USA 3Department of Computer Science, University of California, Los Angeles, California, USA. Correspondence to: Yiran Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Empirically Efficient Hyper Algorithm 2 Provably Efficient Linear-UCB-Hyper |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is released, nor does it provide a direct link to a code repository. It mentions implementing their work based on other official implementations, but not releasing their own. |
| Open Datasets | Yes | In the goal-searching tasks (Figure 5), the agent is spawned following some initial distribution and will receive zero rewards until finding the fixed goal location. Hence the goal of the agent in this series of tasks is to first explore the environment and find the goal location, and then learn to exploit the task by consistently revisiting it. Two mazes with differences in size are used in the experiment: Medium Maze and Large Maze, where it takes an optimal policy taking approximately 150 steps to reach the goal location in Medium Maze and 250 steps in Large Maze. Our experiments on continuous navigation tasks are conducted in the Point Maze domain (Todorov et al., 2012; Fu et al., 2020). For the locomotion environments (Todorov et al., 2012), the agent starts idle and the task is to control the robot to move forward as fast as possible within 1000 steps, and the episode will end if the robot falls down. |
| Dataset Splits | No | The paper describes how the environments are set up, including initial spawn locations and task horizons for different difficulties (e.g., "Medium Maze-Easy", "Medium Maze-Medium"). However, it does not specify explicit training/validation/test splits for any static dataset, as is common in reinforcement learning environments where agents interact directly with the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of algorithms and frameworks like "TD3 (Fujimoto et al., 2018)", "Disagreement (Pathak et al., 2019)", "DQN (Mnih et al., 2013)", and "RND (Burda et al., 2018)", but does not specify software versions (e.g., Python version, specific library versions like PyTorch or TensorFlow versions). |
| Experiment Setup | Yes | In the locomotion experiments, we set the truncation probability p to be 0.01 initially, and decay to 0.001, as we discussed in Section 5. Table A.6.1. HYPERPARAMETERS FOR TD3-BASED ALGORITHMS: Learning Rate 3e-4, Intrinsic Reward Learning Rate 1e-4, Batch Size 256, Policy Update Delay 2, Optimizer Adam, Q-Network Architecture (256, 256), Actor-Network Architecture (256, 256), Activation function Re LU. Table A.6.2. HYPERPARAMETERS FOR CURIOSITY-DRIVEN EXPLORATION: β 1.0, p (0.01, 0.001), Learning Rate of Disagreement Model 1e-4, Disagreement Ensemble Size 5. |