Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning

Authors: Wesley Suttle, Aamodh Suresh, Carlos Nieto-Granda

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, R enyi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy, SMM, and RND on all tasks considered, and on 80% of the tasks compared to datasets collected using R enyi entropy.
Researcher Affiliation Academia Wesley A. Suttle , Aamodh Suresh , Carlos Nieto-Granda U.S. Army Research Laboratory Adelphi, MD 20783, USA EMAIL, EMAIL , EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks are present in the main text of the paper. The methodology is described through mathematical derivations and textual explanations.
Open Source Code No The paper states it used "Unsupervised Reinforcement Learning Benchmark (URLB) framework (Laskin et al., 2021)2" and "Exploratory Data for Offline RL (Ex ORL) framework (Yarats et al., 2022)3" (with footnotes linking to GitHub repositories). However, it does not explicitly state that the authors' own implementation of the behavioral entropy methodology described in this paper is open-sourced or available.
Open Datasets Yes Using standard Mu Jo Co environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, R enyi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms.
Dataset Splits No The paper mentions generating datasets with specific sizes, such as "500K elements" and comparing to "10M-element datasets". It also states "we performed just 100K offline training steps". However, it does not provide explicit training, validation, or test splits for these datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. It mentions using "standard Mu Jo Co environments" but not the hardware on which these environments were simulated or the experiments conducted.
Software Dependencies No The paper mentions using the "Unsupervised Reinforcement Learning Benchmark (URLB) framework (Laskin et al., 2021)" and the "Exploratory Data for Offline RL (Ex ORL) framework (Yarats et al., 2022)". However, it does not specify version numbers for these frameworks or any other software libraries (e.g., Python, PyTorch, TensorFlow, CUDA) used in the experiments.
Experiment Setup Yes For our experiments, we generated BE, RE, SE, RND, and SMM datasets for the Walker and Quadruped environments using the Unsupervised Reinforcement Learning Benchmark (URLB) framework (Laskin et al., 2021)2. We subsequently generated t-SNE plots (Hinton & Roweis, 2002) and PHATE plots (Moon et al., 2019) from the BE, RE, SE, RND, and SMM datasets to visualize their varying state space coverage. Finally, we performed offline RL training on all datasets using the Exploratory Data for Offline RL (Ex ORL) framework (Yarats et0al., 2022)3. Table 2: Data generation hyperparameters Table 3: Offline RL hyperparameters