reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reproducibility Study of "Explaining RL Decisions with Trajectories"

Authors: Clio Feng, Colin Bot, Bart den Boef, Bart Aaldering

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper reports on the reproducibility study on the paper Explaining RL Decisions with Trajectories by Deshmukh et al. (2023). ... This paper conducted quantitative and qualitative experiments across three environments: a Grid-world, an Atari video game (Seaquest), and a continuous control task from Mu Jo Co (Half Cheetah).
Researcher Affiliation	Academia	Clio Feng University of Amsterdam EMAIL Colin Bot University of Amsterdam EMAIL Bart Aaldering University of Amsterdam EMAIL Bart den Boef University of Amsterdam EMAIL
Pseudocode	Yes	Algorithm 1: train Exp Policies Data: Offline Data {τi}, Trajectory Embeddings T, Trajectory Clusters C, Offline RL Algorithm offline RLAlgo Result: Explanation Policies {πj}, Complementary Data Embeddings {dj} Algorithm 2: generate Cluster Attribution Data: State s, Original Policy πorig, Explanation Policies {πj}, Original Data Embedding dorig, Complementary Data Embeddings {dj} Result: Final Cluster Attribution cfinal
Open Source Code	Yes	Our implementations can be found on Git Hub.
Open Datasets	Yes	Seaquest. We used seaquest-mixed-v4 from d4rl-Atari (Fu et al., 2020), as the original paper does not mention a specific dataset version. ... Half Cheetah. We used half-cheetah-medium-v2 from d3rlpy (Seno & Imai, 2022), as the specific version was not mentioned in the original paper.
Dataset Splits	No	Grid-world. The original paper uses 5 Dyna-Q agents placed at random start locations to obtain trajectories of lengths 1 to 15, which resulted in a dataset of 60 trajectories... Seaquest. ...extracted 717 trajectories, divided into sub-trajectories of length 30. ... Half Cheetah. ...consists of 1000 trajectories of length 1000, which are divided into sub-trajectories of size 25... The paper describes how the datasets were processed (e.g., divided into sub-trajectories) and used for evaluation (e.g., "1000 random observations"), but it does not specify explicit training, validation, or test splits for the agents trained in this reproducibility study.
Hardware Specification	Yes	Table 1: Computational requirements. Environment-specific requirements are listed, as well as the estimated kg CO2eq Emissions. Estimations were calculated using Machine Learning Impact calculator (Lacoste et al., 2019). Spec Gridworld Seaquest Half Cheetah Ran on Jupyter Notebook Python script Python script OS 64-bit Ubuntu 22.04 Windows 11 Pro 64-bit Ubuntu 22.04 CPU 6-core Ryzen 4500u Intel Core i5-12400F 6-core Ryzen 4500u at 2.3 GHz at 4.4 GHz at 2.3 GHz GPU Radeon Graphics NVIDIA Ge Force GTX 960 Radeon Graphics RAM 16GB 16GB 16GB
Software Dependencies	No	What was difficult. Version numbers for the libraries utilized in the experiments are missing from the original paper. This resulted in dependency issues which took time to solve. ... The authors mentioned libraries used in their implementations. Though the paper mentions libraries like d3rlpy and d4rl, it does not provide specific version numbers for these or other software dependencies used in their own experimental setup. It explicitly states that version numbers were missing even in the original paper.
Experiment Setup	Yes	Grid-world. We trained the RL policies until convergence, defined by maximum changes between iterations with a threshold of 10 4. ... The Dyna-Q agents were trained for 2 episodes with 5 evaluation episodes per epoch, with learning rates 0.1 and gamma value 0.95. The modified trajectory transformer with LSTM hidden layer size 32 was trained for 25 epochs with a learning rate 1, clipping gradients to a maximum norm of 10. The X-means algorithm was run with a cluster range between 2 and 10 clusters and Tsoft was set to 10. The offline agents had a minimum action value and transition probability of 10 9. Seaquest and Half Cheetah: Seaquest s transformer ran with a vocab size of 18, a block size of 90, and 2719 timesteps, utilizing the reward-conditioned model type. Half Cheetah s transformer used default parameters with a sliding window of size 10. The X-means algorithm operated with a cluster range of 2 to 8 for Seaquest and 2 to 10 for Half Cheetah. Discrete SAC and regular SAC agents from d3rlpy (Seno & Imai, 2022) were employed for Seaquest and Half Cheetah respectively, with hyperparameters consistent with the original paper: actor, critic, and temperature learning rates of 3 10 4, batch size of 256 for Seaquest and 512 for Half Cheetah. Tsoft was set to 103 and 104 for Seaquest and Half Cheetah respectively.