Inherently Robust Control through Maximum-Entropy Learning-Based Rollout
Authors: Felix Bok, Atanas Mirchev, Baris Kayalibay, Ole Jonas Wenzel, Patrick van der Smagt, Justin Bayer
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the robustness and transfer capabilities, experiments were conducted for an illustrative toy example, in simulation, and on a real Franka Panda robot arm. Furthermore, we assess its capabilities on a custom set of challenging environment perturbations applied to the popular Deep Mind Control (DMC) suite, which are designed to probe robustness beyond simple parametric noise. Finally, we present a successful sim to real transfer of MELRO on the Franka Panda robot arm. |
| Researcher Affiliation | Collaboration | Felix Bok EMAIL Volkswagen Group & Technical University of Munich Munich. Atanas Mirchev Volkswagen Group Munich. |
| Pseudocode | Yes | In Algorithm 1 and in Algorithm 2, we provide a detailed description of the off-line training and on-line play of MELRO. |
| Open Source Code | No | The paper does not explicitly provide a statement about open-sourcing its own code or a direct link to a code repository for the MELRO methodology. |
| Open Datasets | Yes | We find that our approach works excellently in the vast majority of cases on both the Real World Reinforcement Learning (RWRL) benchmark and on our own environment perturbations of the popular Deep Mind Control (DMC) suite, which move beyond simple parametric noise. We evaluated MELRO across four simulated environments from the DMC suite (Tassa et al., 2018), paired with five distinct perturbations. To illustrate the challenges posed by traditional Max Ent RL methods and domain randomization, and to highlight the advantages of MELRO, we use a simple 2-Do F point maze environment modeled with the Gymnasium-Robotics suite (de Lazcano et al., 2024). |
| Dataset Splits | No | The paper describes training and evaluation environments and various perturbation scenarios for robustness testing, rather than explicit train/test/validation splits of a fixed dataset. For example, for the toy maze, 'the training of all components is performed without the incorporation of any blocks in the middle of the maze. The evaluation is then conducted on a maze that contains three blocks directly positioned between the starting position and the goal.' Similarly, in simulation experiments, 'Final performance metrics were obtained by assessing each methodology across all perturbation tasks through ten independent random seed evaluations, each comprising ten independent rollouts.' |
| Hardware Specification | Yes | Training and validation for all experiments have been conducted on a MIG 1g.10gb partition of an NVIDIA A100 GPU (compare to NVIDIA (2025) for more details). |
| Software Dependencies | No | The paper mentions implementing components in JAX and using JAX's jit functionality, as well as utilizing the Gymnasium-Robotics suite and Mu Jo Co. However, specific version numbers for these software components (e.g., 'JAX 0.4.23', 'Gymnasium-Robotics 1.0.0') are not provided in the text. |
| Experiment Setup | Yes | In this section, we list the hyperparameter search space and the optimal parameters found for all the methods used. Note that the entropy_weight for the base policy for the policy parameters, as well as for rollout, was fixed to zero. Additionally, the exploration_noise, which describes the standard deviation of a zero-centered Gaussian noise, has been added to the actions to enable exploration only in the non-entropy-regularized MBRL setting. (Appendix B, Page 14, followed by Table 1: MBRL Hyperparameter and Table 2: Rollout Hyperparameter Search Space). |