Marginal Benefit Driven RL Teacher for Unsupervised Environment Design
Authors: Dexun Li, Wenjun Li, Pradeep Varakantham
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide detailed experimental results and ablation analysis to showcase the effectiveness of our methods. We obtain SOTA results among RL-based environment generation methods. We demonstrate the effectiveness of our proposed methods MBe ED and MBe DED through extensive experiments on a wide range of benchmark problems from the literature. We compare our approach against existing UED methods: Domain Randomization (DR) (Tobin et al. 2017), Minimax (Wang et al. 2019), PAIRED (Dennis et al. 2020), REPAIRED (Jiang et al. 2021). |
| Researcher Affiliation | Academia | Singapore Management University EMAIL |
| Pseudocode | Yes | Algorithm 1 provides the pseudocode of these algorithms, and Figure 1 illustrates the overall framework. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its own source code, nor does it provide a direct link to a code repository for the methodology described. It cites 'Kostrikov, I. 2018. PyTorch Implementations of Reinforcement Learning Algorithms. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr-gail.' but this refers to a third-party tool, not the authors' specific implementation. |
| Open Datasets | Yes | We first evaluate our approach on Bipedal Walker environment. This environment entails continuous control with dense rewards. Similar to (Wang et al. 2019),we use a modified version of Bipedal Walker Hardcore from Open AI Gym (Brockman et al. 2016). In Bipedal Walker, there are 8 parameters that indirectly represent the intensity of four kinds of terrain-based obstacles for a two-legged robot: the minimum/maximum roughness of the ground, the minimum/maximum height of stump obstacles, the minimum/maximum width of pit gap obstacles, and the minimum/maximum size of ascending and descending flights of stairs. We investigate the maze navigation environment, which is based on Minigrid (Chevalier Boisvert, Willems, and Pal 2018). We train the environment generator to learn how to build maze environments by choosing the location of the obstacles, the goals, and the starting location of the agent. In the Car Racing environment, we design each track as a closed loop for the student agent to drive around, with the goal of completing a full lap. To enhance the expressiveness of the original Car Racing environment, we reparametrize the tracks using Bézier curves. |
| Dataset Splits | No | The paper describes using specific environments for training and others for testing (e.g., 'zero-shot OOD test performance', 'OOD F1 tracks'), and implicitly mentions 'validation environments' in the context of marginal benefit. For the Bipedal Walker, it lists 'a vanilla Bipedal Walker, a challenging Bipedal Walker-Hardcore environment, and four specific levels in the context of isolated challenges'. However, it does not provide explicit dataset split percentages (e.g., 80/10/10) or specific sample counts for training, validation, and test sets in the traditional sense of splitting a single dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions that 'All agents are trained using Proximal Policy Optimization (PPO) (Schulman et al. 2017).' |
| Software Dependencies | No | The paper mentions 'Proximal Policy Optimization (PPO) (Schulman et al. 2017)' as the algorithm used for training agents and references 'PyTorch Implementations of Reinforcement Learning Algorithms' by Kostrikov (2018). However, it does not specify version numbers for PyTorch, Python, CUDA, or any other critical software libraries or environments used in their own implementation. |
| Experiment Setup | No | The paper describes the general methodology and training process, such as using PPO and a Recurrent Neural Network structure for partially observable settings. It mentions a hyperparameter 'ρ [0, 1]' but does not provide its specific value. While it explains the logic for teacher and student training, it lacks concrete values for key hyperparameters like learning rates, batch sizes, number of epochs/timesteps for training, or other system-level configuration settings in the main text. |