Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

Authors: Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin Yang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance and the robustness of Hyper in this section, with the comparison to baselines with different strategies of balancing exploration & exploitation. We implement all algorithms with TD3 (Fujimoto et al., 2018) as the reinforcement learning algorithm and with Disagreement (Pathak et al., 2019) as the intrinsic reward when curiosity-driven exploration is used. Specifically, we consider the following baselines: TD3 (Fujimoto et al., 2018), Curiosity-Driven Exploration (Curiosity), Decoupled Reinforcement Learning (Sch afer et al., 2021) (Decouple). All baselines are evaluated on various environments that differ in exploration difficulty, exploitation difficulty, and function approximation difficulty. Figure 5 depicts the performance of agents in the continuous goal-searching tasks (Fu et al., 2020) and locomotion tasks (Todorov et al., 2012) averaged over five trials, the shaded area represents the empirical standard deviation. The full results of the performance comparison and environment setup are deferred to the appendix. We now present the performance of Hyper and Curiosity with different intrinsic coefficients β. We evaluate the final performance of Curiosity, Decouple, and Hyper agent over multiple trials with each different value of β choice on five environments. As shown in Figure 6, the Curiosity agent shows peak performance with different values of β.
Researcher Affiliation Academia 1Department of Electrical & Computer Engineering, University of California, Los Angeles, California, USA 2Terasaki Institute for Biomedical Innovation, Los Angeles, California, USA 3Department of Computer Science, University of California, Los Angeles, California, USA. Correspondence to: Yiran Wang <EMAIL>.
Pseudocode Yes Algorithm 1 Empirically Efficient Hyper Algorithm 2 Provably Efficient Linear-UCB-Hyper
Open Source Code No The paper does not explicitly state that the source code for their methodology is released, nor does it provide a direct link to a code repository. It mentions implementing their work based on other official implementations, but not releasing their own.
Open Datasets Yes In the goal-searching tasks (Figure 5), the agent is spawned following some initial distribution and will receive zero rewards until finding the fixed goal location. Hence the goal of the agent in this series of tasks is to first explore the environment and find the goal location, and then learn to exploit the task by consistently revisiting it. Two mazes with differences in size are used in the experiment: Medium Maze and Large Maze, where it takes an optimal policy taking approximately 150 steps to reach the goal location in Medium Maze and 250 steps in Large Maze. Our experiments on continuous navigation tasks are conducted in the Point Maze domain (Todorov et al., 2012; Fu et al., 2020). For the locomotion environments (Todorov et al., 2012), the agent starts idle and the task is to control the robot to move forward as fast as possible within 1000 steps, and the episode will end if the robot falls down.
Dataset Splits No The paper describes how the environments are set up, including initial spawn locations and task horizons for different difficulties (e.g., "Medium Maze-Easy", "Medium Maze-Medium"). However, it does not specify explicit training/validation/test splits for any static dataset, as is common in reinforcement learning environments where agents interact directly with the environment.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions the use of algorithms and frameworks like "TD3 (Fujimoto et al., 2018)", "Disagreement (Pathak et al., 2019)", "DQN (Mnih et al., 2013)", and "RND (Burda et al., 2018)", but does not specify software versions (e.g., Python version, specific library versions like PyTorch or TensorFlow versions).
Experiment Setup Yes In the locomotion experiments, we set the truncation probability p to be 0.01 initially, and decay to 0.001, as we discussed in Section 5. Table A.6.1. HYPERPARAMETERS FOR TD3-BASED ALGORITHMS: Learning Rate 3e-4, Intrinsic Reward Learning Rate 1e-4, Batch Size 256, Policy Update Delay 2, Optimizer Adam, Q-Network Architecture (256, 256), Actor-Network Architecture (256, 256), Activation function Re LU. Table A.6.2. HYPERPARAMETERS FOR CURIOSITY-DRIVEN EXPLORATION: β 1.0, p (0.01, 0.001), Learning Rate of Disagreement Model 1e-4, Disagreement Ensemble Size 5.