Inherently Robust Control through Maximum-Entropy Learning-Based Rollout

Authors: Felix Bok, Atanas Mirchev, Baris Kayalibay, Ole Jonas Wenzel, Patrick van der Smagt, Justin Bayer

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the robustness and transfer capabilities, experiments were conducted for an illustrative toy example, in simulation, and on a real Franka Panda robot arm. Furthermore, we assess its capabilities on a custom set of challenging environment perturbations applied to the popular Deep Mind Control (DMC) suite, which are designed to probe robustness beyond simple parametric noise. Finally, we present a successful sim to real transfer of MELRO on the Franka Panda robot arm.
Researcher Affiliation Collaboration Felix Bok EMAIL Volkswagen Group & Technical University of Munich Munich. Atanas Mirchev Volkswagen Group Munich.
Pseudocode Yes In Algorithm 1 and in Algorithm 2, we provide a detailed description of the off-line training and on-line play of MELRO.
Open Source Code No The paper does not explicitly provide a statement about open-sourcing its own code or a direct link to a code repository for the MELRO methodology.
Open Datasets Yes We find that our approach works excellently in the vast majority of cases on both the Real World Reinforcement Learning (RWRL) benchmark and on our own environment perturbations of the popular Deep Mind Control (DMC) suite, which move beyond simple parametric noise. We evaluated MELRO across four simulated environments from the DMC suite (Tassa et al., 2018), paired with five distinct perturbations. To illustrate the challenges posed by traditional Max Ent RL methods and domain randomization, and to highlight the advantages of MELRO, we use a simple 2-Do F point maze environment modeled with the Gymnasium-Robotics suite (de Lazcano et al., 2024).
Dataset Splits No The paper describes training and evaluation environments and various perturbation scenarios for robustness testing, rather than explicit train/test/validation splits of a fixed dataset. For example, for the toy maze, 'the training of all components is performed without the incorporation of any blocks in the middle of the maze. The evaluation is then conducted on a maze that contains three blocks directly positioned between the starting position and the goal.' Similarly, in simulation experiments, 'Final performance metrics were obtained by assessing each methodology across all perturbation tasks through ten independent random seed evaluations, each comprising ten independent rollouts.'
Hardware Specification Yes Training and validation for all experiments have been conducted on a MIG 1g.10gb partition of an NVIDIA A100 GPU (compare to NVIDIA (2025) for more details).
Software Dependencies No The paper mentions implementing components in JAX and using JAX's jit functionality, as well as utilizing the Gymnasium-Robotics suite and Mu Jo Co. However, specific version numbers for these software components (e.g., 'JAX 0.4.23', 'Gymnasium-Robotics 1.0.0') are not provided in the text.
Experiment Setup Yes In this section, we list the hyperparameter search space and the optimal parameters found for all the methods used. Note that the entropy_weight for the base policy for the policy parameters, as well as for rollout, was fixed to zero. Additionally, the exploration_noise, which describes the standard deviation of a zero-centered Gaussian noise, has been added to the actions to enable exploration only in the non-entropy-regularized MBRL setting. (Appendix B, Page 14, followed by Table 1: MBRL Hyperparameter and Table 2: Rollout Hyperparameter Search Space).