'Explaining RL Decisions with Trajectories’: A Reproducibility Study
Authors: Karim Ahmed Abdel Sadek, Matteo Nulli, Joan Velja, Jort Vincenti
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work investigates the reproducibility of the paper " Explaining RL decisions with trajectories by Deshmukh et al. (2023). The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest and Half Cheetah, Breakout, Q*Bert). While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. |
| Researcher Affiliation | Academia | Karim Abdel Sadek, Matteo Nulli, Joan Velja and Jort Vincenti University of Amsterdam EMAIL |
| Pseudocode | Yes | A rigorous definition of how we calculate the average and the distances can be found in Appendix C.3, together with a detailed pseudo-code in 1. In our experiments, two different cluster algorithms are employed. For DBSCAN we set ϵ = 2.04. Ten final clusters are obtained. No seed is needed given the deterministic nature of DBSCAN. The other cluster method is XMeans. The seed is set to 0 and 99 respectively for the initialization of the centers and for the XMeans algorithm. We perform our experiments on the Grid-World Four Room Environment introduced by Sutton et al. (1999). Its size is 11x11. Given the larger grid and the scope of our experiment, we generate a higher number of trajectories. Namely, we produce 250 trajectories that end in a positive terminal and 50 trajectories that achieve a negative reward. |
| Open Source Code | Yes | We provide our complete code implementation. Additional training and implementation details can be found in Appendix D. The complete code implementation is available in our Git Hub Repository. |
| Open Datasets | Yes | Regarding Grid-World, agents are trained specifically to generate data trajectories. For Seaquest data is instead downloaded from d4rl-Atari Repository, for Breakout and Q*Bert from Expert-offline RL Repository, whereas in the case of Half Cheetah from d4rl Repository of Fu et al. (2020). |
| Dataset Splits | No | The paper describes how "complementary datasets" are created by removing trajectories belonging to a specific cluster for training new explanation policies and actions. However, it does not provide specific training/test/validation dataset splits (e.g., percentages, sample counts, or references to standard splits) for the overall collected data, which is typically expected for model reproduction. Appendix D mentions training for "10 epochs, each taking 100 steps" but does not detail data partitioning for training, validation, and testing. |
| Hardware Specification | Yes | The experiments were performed using a Mac Book with Apple M2 Pro silicon chip with 10 CPU cores (1), Mac Book with Apple M1 silicon chip with 8 CPU cores (2), and a Microsoft Windows 11 Pro with Intel(R) Core(TM) i7-10710U with 6 CPU cores (3). |
| Software Dependencies | No | The paper mentions several software components and libraries, such as d4rl-Atari Repository, Expert-offline RL Repository, d4rl Repository, XMeans clustering algorithm (Novikov 2019), DBSCAN algorithm (Ester et al. 1996), Discrete SAC (Christodoulou 2019), SAC (Haarnoja et al. 2018), d4rl implementation (Seno & Imai 2022), Trajectory Transformer (Janner et al. 2021), BERT base model (Devlin et al. 2019), d3rlpy framework (Fu et al. 2021), and numpy (Harris et al. 2020). However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Appendix C.5 'Additional Hyper-Parameters Experiments' includes a table (Table 11) with 'Alpha', 'Gamma', and 'Eval. Epochs' values, for example: '0.1', '0.95', '15' resulting in a 'Loss Value' of '0.0678'. Additionally, Appendix D states: 'Finally, although the authors mention a training schedule until saturation without further explanation, we followed the guidelines provided in the DR3RLpy framework as outlined by Fu et al. (2021), training both our models for 10 epochs, each taking 100 steps.' |