Reproducibility Study of "Explaining RL Decisions with Trajectories"
Authors: Clio Feng, Colin Bot, Bart den Boef, Bart Aaldering
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper reports on the reproducibility study on the paper Explaining RL Decisions with Trajectories by Deshmukh et al. (2023). ... This paper conducted quantitative and qualitative experiments across three environments: a Grid-world, an Atari video game (Seaquest), and a continuous control task from Mu Jo Co (Half Cheetah). |
| Researcher Affiliation | Academia | Clio Feng University of Amsterdam EMAIL Colin Bot University of Amsterdam EMAIL Bart Aaldering University of Amsterdam EMAIL Bart den Boef University of Amsterdam EMAIL |
| Pseudocode | Yes | Algorithm 1: train Exp Policies Data: Offline Data {τi}, Trajectory Embeddings T, Trajectory Clusters C, Offline RL Algorithm offline RLAlgo Result: Explanation Policies {πj}, Complementary Data Embeddings {dj} Algorithm 2: generate Cluster Attribution Data: State s, Original Policy πorig, Explanation Policies {πj}, Original Data Embedding dorig, Complementary Data Embeddings {dj} Result: Final Cluster Attribution cfinal |
| Open Source Code | Yes | Our implementations can be found on Git Hub. |
| Open Datasets | Yes | Seaquest. We used seaquest-mixed-v4 from d4rl-Atari (Fu et al., 2020), as the original paper does not mention a specific dataset version. ... Half Cheetah. We used half-cheetah-medium-v2 from d3rlpy (Seno & Imai, 2022), as the specific version was not mentioned in the original paper. |
| Dataset Splits | No | Grid-world. The original paper uses 5 Dyna-Q agents placed at random start locations to obtain trajectories of lengths 1 to 15, which resulted in a dataset of 60 trajectories... Seaquest. ...extracted 717 trajectories, divided into sub-trajectories of length 30. ... Half Cheetah. ...consists of 1000 trajectories of length 1000, which are divided into sub-trajectories of size 25... The paper describes how the datasets were processed (e.g., divided into sub-trajectories) and used for evaluation (e.g., "1000 random observations"), but it does not specify explicit training, validation, or test splits for the agents trained in this reproducibility study. |
| Hardware Specification | Yes | Table 1: Computational requirements. Environment-specific requirements are listed, as well as the estimated kg CO2eq Emissions. Estimations were calculated using Machine Learning Impact calculator (Lacoste et al., 2019). Spec Gridworld Seaquest Half Cheetah Ran on Jupyter Notebook Python script Python script OS 64-bit Ubuntu 22.04 Windows 11 Pro 64-bit Ubuntu 22.04 CPU 6-core Ryzen 4500u Intel Core i5-12400F 6-core Ryzen 4500u at 2.3 GHz at 4.4 GHz at 2.3 GHz GPU Radeon Graphics NVIDIA Ge Force GTX 960 Radeon Graphics RAM 16GB 16GB 16GB |
| Software Dependencies | No | What was difficult. Version numbers for the libraries utilized in the experiments are missing from the original paper. This resulted in dependency issues which took time to solve. ... The authors mentioned libraries used in their implementations. Though the paper mentions libraries like d3rlpy and d4rl, it does not provide specific version numbers for these or other software dependencies used in their own experimental setup. It explicitly states that version numbers were missing even in the original paper. |
| Experiment Setup | Yes | Grid-world. We trained the RL policies until convergence, defined by maximum changes between iterations with a threshold of 10 4. ... The Dyna-Q agents were trained for 2 episodes with 5 evaluation episodes per epoch, with learning rates 0.1 and gamma value 0.95. The modified trajectory transformer with LSTM hidden layer size 32 was trained for 25 epochs with a learning rate 1, clipping gradients to a maximum norm of 10. The X-means algorithm was run with a cluster range between 2 and 10 clusters and Tsoft was set to 10. The offline agents had a minimum action value and transition probability of 10 9. Seaquest and Half Cheetah: Seaquest s transformer ran with a vocab size of 18, a block size of 90, and 2719 timesteps, utilizing the reward-conditioned model type. Half Cheetah s transformer used default parameters with a sliding window of size 10. The X-means algorithm operated with a cluster range of 2 to 8 for Seaquest and 2 to 10 for Half Cheetah. Discrete SAC and regular SAC agents from d3rlpy (Seno & Imai, 2022) were employed for Seaquest and Half Cheetah respectively, with hyperparameters consistent with the original paper: actor, critic, and temperature learning rates of 3 10 4, batch size of 256 for Seaquest and 512 for Half Cheetah. Tsoft was set to 103 and 104 for Seaquest and Half Cheetah respectively. |