reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization

Authors: Sascha Marton, Tim Grams, Florian Vogt, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. ... Results. Through extensive experiments on benchmark RL environments, we demonstrate that SYMPOL does not suffer from information loss and outperforms existing tree-based RL approaches in terms of interpretability and performance (Section 5.2), providing human-understandable explanations.
Researcher Affiliation	Academia	1University of Mannheim 2Technical University of Clausthal 3University of Rostock EMAIL EMAIL
Pseudocode	Yes	B.2 SYMPOL ALGORITHMIC PRESENTATION 1 def tree_function(obs): 2 if obs[field one to front and one to left] is empty : 3 if obs[field one to front] is lava : 4 if obs[field one to left] is empty : 5 action = turn right 7 action = turn left 9 action = move forward 11 action = turn left 12 return action
Open Source Code	Yes	Our implementation is available under: https://github.com/s-marton/sympol
Open Datasets	Yes	Specifically, we used the control environments Cart Pole (CP), Acrobot (AB), Lunar Lander (LL), Mountain Car Continuous (MC-C) and Pendulum (PD-C), as well as the Mini Grid (Chevalier-Boisvert et al., 2023) environments Empty-Random (E-R), Door Key (DK), Lava Gap (LG) and Dist Shift (DS).
Dataset Splits	No	The paper does not provide specific train/test/validation dataset splits. It describes an evaluation procedure over multiple random trainings and evaluation episodes on RL environments, which generate data dynamically: "We report the average undiscounted cumulative reward over 5 random trainings with 5 random evaluation episodes each (=25 evaluations for each method)." While hyperparameters were optimized based on validation reward, explicit dataset splits are not mentioned.
Hardware Specification	Yes	The experiments were conducted on a single NVIDIA RTX A6000.
Software Dependencies	No	The paper mentions several software components and frameworks like JAX, Optuna, Gymnasium, but does not provide specific version numbers for them. For example: "We implemented SYMPOL in a highly efficient single-file JAX implementation..." and "...we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019)" and "...used the standard Gymnasium (Towers et al., 2024) implementation."
Experiment Setup	Yes	For SYMPOL, SDT and MLP, we optimized the hyperparameters based on the validation reward with optuna (Akiba et al., 2019) for 60 trials using a predefined grid. For D-SDT we discretized the SDT and for SA-DT, we distilled the MLP with the highest performance. More details on the hyperparameters can be found in Appendix C. Appendix C contains tables listing detailed hyperparameters grids (e.g., Table 13: HPO Grid SYMPOL) and the best hyperparameters selected for each environment (e.g., Table 16: Best Hyperparameters SYMPOL (Control)).