reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning

Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive evaluation of several state-of-the-art (SOTA) baselines from standard RL, robust RL, safe RL, and multi-agent RL using representative tasks in Robust-Gymnasium. Our findings reveal that current algorithms often fall short of expectations in challenging tasks, even under single-stage disruptions, highlighting the need for new robust RL approaches. Furthermore, our experiments demonstrate the flexibility of Robust-Gymnasium by encompassing tasks with disruptions across all stages and four disturbance modes, including an adversarial model using a large language model (LLM).
Researcher Affiliation	Academia	1 University of California, Berkeley 2 California Institute of Technology 3 Shanghai Jiao Tong University 4 Virginia Tech 5 Carnegie Mellon University
Pseudocode	Yes	the pseudo code is shown in Listing 3. Furthermore, Equation (3) is for initial noise, and Equation (4) is for noise during training we use these Equarions to consider the incorporation of stochastic disturbances into the Ant robot model, again including factors like gravity fluctuations and wind speed variations, the pseudo code is shown in Listing 4. Apart from wind and gravity disturbances, we also investigate the robot shape disturbances during policy learning, as shown in Equations (5)-(8), and an example of pseudo code is shown in Listing 5.
Open Source Code	Yes	The code is available at this website1. ... 2Website with the introduction, code, and examples: https://robust-gym.github.io/
Open Datasets	Yes	We introduce Robust-Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components agents observed state and reward, agents actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi-agent RL, it provides an open-source and user-friendly tool for the community to assess current methods and foster the development of robust RL algorithms. ... Gymnasium-Box2D (three relative simple control tasks in games). These tasks are from Gymnasium (Towers et al., 2024)... Gymnisium-Mu Jo Co (eleven control tasks). It includes various robot models... Robosuite (twelve tasks for various modular robot platforms).
Dataset Splits	No	We mainly focus on two evaluation settings: In-training: the disruptor simultaneously affects the agent and environment during both training and testing at each time step. This process is typically used in robotics to address sim-to-real gaps by introducing potential noise during training; 2) Post-training: the disruptor only impacts the agent and environment during testing, mimicking scenarios where learning algorithms are unaware of testing variability. The paper describes evaluation settings related to when disruptions occur (in-training vs. post-training) but does not provide specific dataset split percentages, sample counts, or citations to predefined splits for reproducibility in terms of data partitioning.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, processor types, or memory amounts) are provided in the paper.
Software Dependencies	No	No specific software versions (e.g., Python, PyTorch, TensorFlow, CUDA, scikit-learn, etc.) are explicitly mentioned in the paper, beyond general framework names like 'Gymnasium'.
Experiment Setup	Yes	We deploy several SOTA baselines in our benchmark to evaluate their robustness across various challenging scenarios. The implementation parameters associated with these methods are provided in Tables 9-13.