reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Marginal Benefit Driven RL Teacher for Unsupervised Environment Design

Authors: Dexun Li, Wenjun Li, Pradeep Varakantham

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we provide detailed experimental results and ablation analysis to showcase the effectiveness of our methods. We obtain SOTA results among RL-based environment generation methods. We demonstrate the effectiveness of our proposed methods MBe ED and MBe DED through extensive experiments on a wide range of benchmark problems from the literature. We compare our approach against existing UED methods: Domain Randomization (DR) (Tobin et al. 2017), Minimax (Wang et al. 2019), PAIRED (Dennis et al. 2020), REPAIRED (Jiang et al. 2021).
Researcher Affiliation	Academia	Singapore Management University EMAIL
Pseudocode	Yes	Algorithm 1 provides the pseudocode of these algorithms, and Figure 1 illustrates the overall framework.
Open Source Code	No	The paper does not contain an explicit statement about the release of its own source code, nor does it provide a direct link to a code repository for the methodology described. It cites 'Kostrikov, I. 2018. PyTorch Implementations of Reinforcement Learning Algorithms. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr-gail.' but this refers to a third-party tool, not the authors' specific implementation.
Open Datasets	Yes	We first evaluate our approach on Bipedal Walker environment. This environment entails continuous control with dense rewards. Similar to (Wang et al. 2019),we use a modified version of Bipedal Walker Hardcore from Open AI Gym (Brockman et al. 2016). In Bipedal Walker, there are 8 parameters that indirectly represent the intensity of four kinds of terrain-based obstacles for a two-legged robot: the minimum/maximum roughness of the ground, the minimum/maximum height of stump obstacles, the minimum/maximum width of pit gap obstacles, and the minimum/maximum size of ascending and descending flights of stairs. We investigate the maze navigation environment, which is based on Minigrid (Chevalier Boisvert, Willems, and Pal 2018). We train the environment generator to learn how to build maze environments by choosing the location of the obstacles, the goals, and the starting location of the agent. In the Car Racing environment, we design each track as a closed loop for the student agent to drive around, with the goal of completing a full lap. To enhance the expressiveness of the original Car Racing environment, we reparametrize the tracks using Bézier curves.
Dataset Splits	No	The paper describes using specific environments for training and others for testing (e.g., 'zero-shot OOD test performance', 'OOD F1 tracks'), and implicitly mentions 'validation environments' in the context of marginal benefit. For the Bipedal Walker, it lists 'a vanilla Bipedal Walker, a challenging Bipedal Walker-Hardcore environment, and four specific levels in the context of isolated challenges'. However, it does not provide explicit dataset split percentages (e.g., 80/10/10) or specific sample counts for training, validation, and test sets in the traditional sense of splitting a single dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions that 'All agents are trained using Proximal Policy Optimization (PPO) (Schulman et al. 2017).'
Software Dependencies	No	The paper mentions 'Proximal Policy Optimization (PPO) (Schulman et al. 2017)' as the algorithm used for training agents and references 'PyTorch Implementations of Reinforcement Learning Algorithms' by Kostrikov (2018). However, it does not specify version numbers for PyTorch, Python, CUDA, or any other critical software libraries or environments used in their own implementation.
Experiment Setup	No	The paper describes the general methodology and training process, such as using PPO and a Recurrent Neural Network structure for partially observable settings. It mentions a hyperparameter 'ρ [0, 1]' but does not provide its specific value. While it explains the logic for teacher and student training, it lacks concrete values for key hyperparameters like learning rates, batch sizes, number of epochs/timesteps for training, or other system-level configuration settings in the main text.