Diversifying Policy Behaviors with Extrinsic Behavioral Curiosity

Authors: Zhenglin Wan, Xingrui Yu, David Mark Bossens, Yueming Lyu, Qing Guo, Flint Xiaofeng Fan, Yew-Soon Ong, Ivor Tsang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of EBC in exploring diverse behaviors, we evaluate our method on multiple robot locomotion tasks. EBC improves the performance of QD-IRL instances with GAIL, VAIL, and Diff AIL across all included environments by up to 185%, 42%, and 150%, even surpassing expert performance by 20% in Humanoid. Furthermore, we demonstrate that EBC is applicable to Gradient Arborescence-based Quality Diversity Reinforcement Learning (QD-RL) algorithms, where it substantially improves performance and provides a generic technique for learning behavioral diverse policies.
Researcher Affiliation Academia 1School of Data Science, The Chinese University of Hong Kong, Shenzhen, China 2CFAR, Agency for Science, Technology and Research, Singapore 3IHPC, Agency for Science, Technology and Research, Singapore 4College of Computing and Data Science, Nanyang Technological University (NTU), Singapore. Correspondence to: Xingrui Yu <EMAIL>.
Pseudocode Yes We provide the pseudo-code of the general procedure of QD-IRL with PPGA in Algorithm 1, where different IRL methods differ from the update reward model part and other parts requiring the reward model to calculate learned reward (highlighted in red). Please refer to Appendix B for algorithms for updating the archive (Algorithm 2), updating the reward model, calculating the rewards with the EBC bonus, and computing the gradients for the objective and measures (Algorithm 3).
Open Source Code Yes The source code of this work is provided at https://github.com/vanzll/EBC.
Open Datasets No We use a policy archive obtained by PPGA to generate expert demonstrations. In line with a real-world scenario with limited demonstrations, we first sample the top 500 high-performance elites from the archive as a candidate pool, and then select a few demonstrations such that they are as diverse as possible. This process results in 4 diverse demonstrations (episodes) per environment. Appendix D shows the statistical properties for selected demonstrations.
Dataset Splits No We use a policy archive obtained by PPGA to generate expert demonstrations. In line with a real-world scenario with limited demonstrations, we first sample the top 500 high-performance elites from the archive as a candidate pool, and then select a few demonstrations such that they are as diverse as possible. This process results in 4 diverse demonstrations (episodes) per environment. Appendix D shows the statistical properties for selected demonstrations. The paper does not specify how these demonstrations or any other data were split into training, validation, or test sets.
Hardware Specification Yes All Experiments are conducted on a system with four A40 48G GPUs, an AMD EPYC 7543P 32-core CPU, and a Linux OS. Each single experiment only requires one A40 48G GPU and takes roughly two days.
Software Dependencies No Our experiments are based on the PPGA implementation using the Brax simulator (Freeman et al., 2021), enhanced with QDax wrappers for measure calculation (Lim et al., 2022). We leverage pyribs (Tjanaka et al., 2023) and Clean RL s PPO (Huang et et al., 2020) for implementing the PPGA algorithm. The paper mentions software components but does not provide specific version numbers for them.
Experiment Setup Yes Appendix E. Hyperparameter Setting Table 3: List of relevant hyperparameters for PPGA shared across all environments. Table 4: List of relevant hyperparameters for AIRL, GAILs shared across all environments. Table 5: List of relevant hyperparameters for VAILs shared across all environments. Table 6: List of relevant hyperparameters for GIRIL shared across all environments. Table 7: List of relevant hyperparameters for Diff AIL and Diff AIL-EBC.