Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

Authors: Sascha Marton, Tim Grams, Florian Vogt, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. ... Results. Through extensive experiments on benchmark RL environments, we demonstrate that SYMPOL does not suffer from information loss and outperforms existing tree-based RL approaches in terms of interpretability and performance (Section 5.2), providing human-understandable explanations.
Researcher Affiliation Academia 1University of Mannheim 2Technical University of Clausthal 3University of Rostock EMAIL EMAIL
Pseudocode Yes B.2 SYMPOL ALGORITHMIC PRESENTATION 1 def tree_function(obs): 2 if obs[field one to front and one to left] is empty : 3 if obs[field one to front] is lava : 4 if obs[field one to left] is empty : 5 action = turn right 7 action = turn left 9 action = move forward 11 action = turn left 12 return action
Open Source Code Yes Our implementation is available under: https://github.com/s-marton/sympol
Open Datasets Yes Specifically, we used the control environments Cart Pole (CP), Acrobot (AB), Lunar Lander (LL), Mountain Car Continuous (MC-C) and Pendulum (PD-C), as well as the Mini Grid (Chevalier-Boisvert et al., 2023) environments Empty-Random (E-R), Door Key (DK), Lava Gap (LG) and Dist Shift (DS).
Dataset Splits No The paper does not provide specific train/test/validation dataset splits. It describes an evaluation procedure over multiple random trainings and evaluation episodes on RL environments, which generate data dynamically: "We report the average undiscounted cumulative reward over 5 random trainings with 5 random evaluation episodes each (=25 evaluations for each method)." While hyperparameters were optimized based on validation reward, explicit dataset splits are not mentioned.
Hardware Specification Yes The experiments were conducted on a single NVIDIA RTX A6000.
Software Dependencies No The paper mentions several software components and frameworks like JAX, Optuna, Gymnasium, but does not provide specific version numbers for them. For example: "We implemented SYMPOL in a highly efficient single-file JAX implementation..." and "...we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019)" and "...used the standard Gymnasium (Towers et al., 2024) implementation."
Experiment Setup Yes For SYMPOL, SDT and MLP, we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019) for 60 trials using a predefined grid. For D-SDT we discretized the SDT and for SA-DT, we distilled the MLP with the highest performance. More details on the hyperparameters can be found in Appendix C. Appendix C contains tables listing detailed hyperparameters grids (e.g., Table 13: HPO Grid SYMPOL) and the best hyperparameters selected for each environment (e.g., Table 16: Best Hyperparameters SYMPOL (Control)).