Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Behaviour Discovery and Attribution for Explainable Reinforcement Learning
Authors: Rishav Rishav, Somjit Nath, Vincent Michalski, Samira Ebrahimi Kahou
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on four diverse offline RL environments show that our approach discovers meaningful behaviors and outperforms trajectory-level baselines in fidelity, human preference, and cluster coherence. |
| Researcher Affiliation | Academia | Rishav Rishav EMAIL University of Calgary, Mila Somjit Nath Mc Gill University, Mila Vincent Michalski Université de Montréal, Mila Samira Ebrahimi Kahou University of Calgary, Canada CIFAR AI Chair, Mila |
| Pseudocode | No | The paper describes the methodology using textual explanations, mathematical equations, and diagrams (e.g., Figure 2 for an overview), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available 1. 1https://rish-av.github.io/bexrl |
| Open Datasets | Yes | We evaluate the effectiveness of our framework for behavior discovery and attribution using three benchmark environments from D4RL(Fu et al., 2020) halfcheetah-medium-v2, pen-expert-v1 as and seaquest-mixedv0 from D4RL-atari repository (Takuseno, 2025) as well as a custom environment, Mini Grid Two Goals Lava, based on the Mini Grid suite. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation splits for the datasets used to train the main models or VQ-VAE. It mentions that "A policy π is trained on the full dataset." and describes how metrics like Average Fidelity Score are computed over a sample of actions/episodes, but not the overall dataset partitioning. |
| Hardware Specification | Yes | Table 8: Hyperparameter halfcheetahmedium-v2 Mini Grid Two Goals Lava seaquestmixed-v0 pen-expert-v1 ... Hardware A100 GPU A100 GPU A100 GPU A100 GPU |
| Software Dependencies | No | Table 8 mentions the 'Optimizer Adam' but does not specify any programming languages, libraries, or other software components with version numbers needed for replication. |
| Experiment Setup | Yes | Table 8: Hyperparameter settings for all four environments. Learning Rate (LR) 1 10 4 1 10 4 1 10 4 1 10 4 Sequence Length (seq_len) 50 Variable (max 40) 30 30 Batch Size 64 32 64 32 Number of Codes 128 16 64 64 Embedding Dimension 128 128 128 128 Combination Param (λ) 0.75 0.45 0.6 0.6 Num Epochs 50 50 50 50 Optimizer Adam Adam Adam Adam LR Scheduler Linear decay Linear decay Linear decay Linear decay Teacher Forcing Linear decay to 0 Linear decay to 0 Linear decay to 0 Linear decay to 0 Transformer Heads 4 4 4 4 Encoder/Decoder Layers 4 2 4 4 Transformer Hidden Dim 128 128 128 128 Frame Skip 4 |