Counterfactual Explanations for Continuous Action Reinforcement Learning
Authors: Shuyang Dong, Shangtong Zhang, Lu Feng
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations in two RL domains, Diabetes Control and Lunar Lander, demonstrate the effectiveness, efficiency, and generalization of our approach, enabling more interpretable and trustworthy RL applications. Experimental results demonstrate the effectiveness, efficiency, and generalization of our approach, paving the way for more interpretable and trustworthy RL applications in high-stakes settings. |
| Researcher Affiliation | Academia | Shuyang Dong , Shangtong Zhang and Lu Feng University of Virginia EMAIL |
| Pseudocode | Yes | Algorithm 1: Counterfactual Generation |
| Open Source Code | Yes | Our implementation1 is based on Stable-Baselines3 [Raffin et al., 2021]. 1Code is available at: https://github.com/safe-autonomy-lab/Counterfactual RL |
| Open Datasets | Yes | We implemented the proposed approach and evaluated it in two RL domains: (i) diabetes control using the FDA-approved UVA/PADOVA simulator [Man et al., 2014], and (ii) Lunar Lander from Open AI Gym [Brockman, 2016]. |
| Dataset Splits | Yes | Both settings included 18 unique trajectories in each training and test set. Both settings included 12 randomly sampled trajectories in each training and test set. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory amounts) are provided in the paper. |
| Software Dependencies | No | Our implementation1 is based on Stable-Baselines3 [Raffin et al., 2021]. While Stable-Baselines3 is mentioned, specific version numbers for it or other software dependencies like Python, PyTorch, or CUDA are not provided in the text. |
| Experiment Setup | Yes | In the single-environment setting, a baseline policy was trained on a chosen patient profile for 100,000 steps, with a learning rate of 0.0001 and a gradient step size of 50. ... After a warm-up phase, batches of 256 trajectories were sampled for model updates using a learning rate of 0.0001 and 50 gradient steps. ... In the single-environment setting, a baseline policy was trained for 3,000 steps with a learning rate of 0.0001 and a gradient step size of 20. ... The proposed approach was used to generate counterfactual trajectories with a learning rate of 0.00001 and a gradient step size of 20 |