reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Behaviour Discovery and Attribution for Explainable Reinforcement Learning

Authors: Rishav Rishav, Somjit Nath, Vincent Michalski, Samira Ebrahimi Kahou

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on four diverse offline RL environments show that our approach discovers meaningful behaviors and outperforms trajectory-level baselines in fidelity, human preference, and cluster coherence.
Researcher Affiliation	Academia	Rishav Rishav EMAIL University of Calgary, Mila Somjit Nath Mc Gill University, Mila Vincent Michalski Université de Montréal, Mila Samira Ebrahimi Kahou University of Calgary, Canada CIFAR AI Chair, Mila
Pseudocode	No	The paper describes the methodology using textual explanations, mathematical equations, and diagrams (e.g., Figure 2 for an overview), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available 1. 1https://rish-av.github.io/bexrl
Open Datasets	Yes	We evaluate the effectiveness of our framework for behavior discovery and attribution using three benchmark environments from D4RL(Fu et al., 2020) halfcheetah-medium-v2, pen-expert-v1 as and seaquest-mixedv0 from D4RL-atari repository (Takuseno, 2025) as well as a custom environment, Mini Grid Two Goals Lava, based on the Mini Grid suite.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation splits for the datasets used to train the main models or VQ-VAE. It mentions that "A policy π is trained on the full dataset." and describes how metrics like Average Fidelity Score are computed over a sample of actions/episodes, but not the overall dataset partitioning.
Hardware Specification	Yes	Table 8: Hyperparameter halfcheetahmedium-v2 Mini Grid Two Goals Lava seaquestmixed-v0 pen-expert-v1 ... Hardware A100 GPU A100 GPU A100 GPU A100 GPU
Software Dependencies	No	Table 8 mentions the 'Optimizer Adam' but does not specify any programming languages, libraries, or other software components with version numbers needed for replication.
Experiment Setup	Yes	Table 8: Hyperparameter settings for all four environments. Learning Rate (LR) 1 10 4 1 10 4 1 10 4 1 10 4 Sequence Length (seq_len) 50 Variable (max 40) 30 30 Batch Size 64 32 64 32 Number of Codes 128 16 64 64 Embedding Dimension 128 128 128 128 Combination Param (λ) 0.75 0.45 0.6 0.6 Num Epochs 50 50 50 50 Optimizer Adam Adam Adam Adam LR Scheduler Linear decay Linear decay Linear decay Linear decay Teacher Forcing Linear decay to 0 Linear decay to 0 Linear decay to 0 Linear decay to 0 Transformer Heads 4 4 4 4 Encoder/Decoder Layers 4 2 4 4 Transformer Hidden Dim 128 128 128 128 Frame Skip 4