Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization
Authors: Sascha Marton, Tim Grams, Florian Vogt, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. ... Results. Through extensive experiments on benchmark RL environments, we demonstrate that SYMPOL does not suffer from information loss and outperforms existing tree-based RL approaches in terms of interpretability and performance (Section 5.2), providing human-understandable explanations. |
| Researcher Affiliation | Academia | 1University of Mannheim 2Technical University of Clausthal 3University of Rostock EMAIL EMAIL |
| Pseudocode | Yes | B.2 SYMPOL ALGORITHMIC PRESENTATION 1 def tree_function(obs): 2 if obs[field one to front and one to left] is empty : 3 if obs[field one to front] is lava : 4 if obs[field one to left] is empty : 5 action = turn right 7 action = turn left 9 action = move forward 11 action = turn left 12 return action |
| Open Source Code | Yes | Our implementation is available under: https://github.com/s-marton/sympol |
| Open Datasets | Yes | Specifically, we used the control environments Cart Pole (CP), Acrobot (AB), Lunar Lander (LL), Mountain Car Continuous (MC-C) and Pendulum (PD-C), as well as the Mini Grid (Chevalier-Boisvert et al., 2023) environments Empty-Random (E-R), Door Key (DK), Lava Gap (LG) and Dist Shift (DS). |
| Dataset Splits | No | The paper does not provide specific train/test/validation dataset splits. It describes an evaluation procedure over multiple random trainings and evaluation episodes on RL environments, which generate data dynamically: "We report the average undiscounted cumulative reward over 5 random trainings with 5 random evaluation episodes each (=25 evaluations for each method)." While hyperparameters were optimized based on validation reward, explicit dataset splits are not mentioned. |
| Hardware Specification | Yes | The experiments were conducted on a single NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions several software components and frameworks like JAX, Optuna, Gymnasium, but does not provide specific version numbers for them. For example: "We implemented SYMPOL in a highly efficient single-file JAX implementation..." and "...we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019)" and "...used the standard Gymnasium (Towers et al., 2024) implementation." |
| Experiment Setup | Yes | For SYMPOL, SDT and MLP, we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019) for 60 trials using a predefined grid. For D-SDT we discretized the SDT and for SA-DT, we distilled the MLP with the highest performance. More details on the hyperparameters can be found in Appendix C. Appendix C contains tables listing detailed hyperparameters grids (e.g., Table 13: HPO Grid SYMPOL) and the best hyperparameters selected for each environment (e.g., Table 16: Best Hyperparameters SYMPOL (Control)). |