Robust Reinforcement Learning in a Sample-Efficient Setting

Authors: Siemen Herremans, Ali Anwar, Siegfried Mercelis

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results indicate a notable improvement in policy robustness on high-dimensional control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs, while maintaining the data-efficiency of the base algorithm. Our methodology is also compared against various other robust RL approaches. We further examine how pessimism is achieved by exploring the learned deviation between the proposed auxiliary world model and the nominal model.
Researcher Affiliation Academia Siemen Herremans EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Ali Anwar EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec Siegfried Mercelis EMAIL IDLab Department of Electronics and ICT, Faculty of Applied Engineering University of Antwerp imec
Pseudocode Yes Algorithm 1 RMBPO (Additions in blue) Algorithm 2 Supervised Pessimistic Distribution Learning with an Auxiliary Model
Open Source Code No Evaluation code and weights available at https://github.com/rmbpo-eval/rmbpo-tmlr ...The authors are not able to release the full source code of RMBPO at the time of submission of this paper, however, the reader is encouraged to contact the first author of this work with any related questions.
Open Datasets Yes Secondly (ii), we evaluate the empirical performance of our algorithm on high-dimensional Multiple Joint Control (Mu Jo Co) and Deepmind Control Suite (DMC) benchmarks under simultaneous parameter distortions
Dataset Splits No The paper describes an agent interacting with environments (Mu Jo Co and DMC), which means the datasets are dynamically generated through these interactions. The concept of fixed training/testing/validation splits, as defined in supervised learning for static datasets, is not applicable in this context. The evaluation is performed on distorted versions of the training environment.
Hardware Specification Yes Experiments were run on a Ubuntu20.04 (Docker) machine with a single NVIDIA Quadro RTX4000 GPU, two CPU cores, and 38GB of memory.
Software Dependencies No The paper mentions "Ubuntu20.04 (Docker)" as the operating system and containerization platform but does not specify versions for critical software libraries, frameworks (e.g., Python, JAX/PyTorch/TensorFlow), or other dependencies essential for replicating the experiment environment beyond the OS.
Experiment Setup Yes Table 2: Hyperparameters Hyperparameter Hopper-v4 Walker2d-v4 Half Cheetah-v4 DMC Walker η 4 0.5 0.25 / 0.5 0.25 λa 1e-4 1e-4 1e-4 1e-4 Total environment steps 125k 300k 400k 200k