HyPoGen: Optimization-Biased Hypernetworks for Generalizable Policy Generation

Authors: Hanxiang Ren, Li Sun, Xulong Wang, Pei Zhou, Zewen Wu, Siyan Dong, Difan Zou, Youyi Zheng, Yanchao Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on locomotion and manipulation benchmarks show that Hy Po Gen significantly outperforms state-of-the-art methods in generating policies for unseen target tasks without any demonstrations, achieving higher success rates and underscoring the potential of optimization-biased hypernetworks in advancing generalizable policy generation.
Researcher Affiliation Collaboration Hanxiang Ren Li Sun Xulong Wang Pei Zhou Zewen Wu ZJU & HKU HKU ZJU HKU HKU & Transcengram Siyan Dong Difan Zou Youyi Zheng Yanchao Yang HKU HKU ZJU HKU : First Authors (EMAIL, EMAIL)
Pseudocode No The paper describes the methodology using mathematical equations and prose but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at: https://github.com/Re Nginx/Hy Po Gen. We have made significant efforts to ensure the reproducibility of our work. The code, datasets, and instructions necessary to replicate our experiments are publicly available at https://github.com/Re Nginx/Hy Po Gen.
Open Datasets Yes We leverage two sets of tasks for evaluation. The first set is derived from Mu Jo Co environments... The second set of tasks is sourced from the Mani Skill environment... For Mu Jo Co Environments, we pre-train a TD3 Fujimoto et al. (2018) agent on each specification separately for 1 x 10^6 steps as the expert. For each specification, 10 trajectories of length 1000 are collected as the demonstration. For Maniskill Environments, we pre-train a PPO agent Schulman et al. (2017) for 4 x 10^6 steps and collect 1000 successful trajectories for each specification.
Dataset Splits Yes For Mu Jo Co Environments... We randomly split the specifications into training and test sets using a 20% to 80% ratio. At test time, we evaluate the performance by calculating the average reward of roll-out policies. The training/test splitting process is repeated five times, and the average is calculated to reduce the effects of randomness. For Maniskill Environments... We uniformly sample 30% of the data for training, and use the remaining 70% for testing.
Hardware Specification Yes We train our model on an NVIDIA 4090 GPU with a batch size of 512 for 2000 epochs (approximately 11 hours); we use the same hyperparameters in Mani Skill except for training 1000 epochs (approximately 10 hours).
Software Dependencies No We implemented our method in Py Torch Paszke et al. (2019) and the hyperparameters are reported in Tab. 10, Tab. 11, and Tab. 12.
Experiment Setup Yes Implementation details of Hy Po Gen. ... We use K = 8 hypernet blocks and apply Adam (Kingma & Ba, 2015) optimizer with learning rate 1e-4. We train our model on an NVIDIA 4090 GPU with a batch size of 512 for 2000 epochs (approximately 11 hours); we use the same hyperparameters in Mani Skill except for training 1000 epochs (approximately 10 hours). We detail our hyperparameters in Sec. B.2.