Counterfactual Explanations for Continuous Action Reinforcement Learning

Authors: Shuyang Dong, Shangtong Zhang, Lu Feng

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations in two RL domains, Diabetes Control and Lunar Lander, demonstrate the effectiveness, efficiency, and generalization of our approach, enabling more interpretable and trustworthy RL applications. Experimental results demonstrate the effectiveness, efficiency, and generalization of our approach, paving the way for more interpretable and trustworthy RL applications in high-stakes settings.
Researcher Affiliation Academia Shuyang Dong , Shangtong Zhang and Lu Feng University of Virginia EMAIL
Pseudocode Yes Algorithm 1: Counterfactual Generation
Open Source Code Yes Our implementation1 is based on Stable-Baselines3 [Raffin et al., 2021]. 1Code is available at: https://github.com/safe-autonomy-lab/Counterfactual RL
Open Datasets Yes We implemented the proposed approach and evaluated it in two RL domains: (i) diabetes control using the FDA-approved UVA/PADOVA simulator [Man et al., 2014], and (ii) Lunar Lander from Open AI Gym [Brockman, 2016].
Dataset Splits Yes Both settings included 18 unique trajectories in each training and test set. Both settings included 12 randomly sampled trajectories in each training and test set.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory amounts) are provided in the paper.
Software Dependencies No Our implementation1 is based on Stable-Baselines3 [Raffin et al., 2021]. While Stable-Baselines3 is mentioned, specific version numbers for it or other software dependencies like Python, PyTorch, or CUDA are not provided in the text.
Experiment Setup Yes In the single-environment setting, a baseline policy was trained on a chosen patient profile for 100,000 steps, with a learning rate of 0.0001 and a gradient step size of 50. ... After a warm-up phase, batches of 256 trajectories were sampled for model updates using a learning rate of 0.0001 and 50 gradient steps. ... In the single-environment setting, a baseline policy was trained for 3,000 steps with a learning rate of 0.0001 and a gradient step size of 20. ... The proposed approach was used to generate counterfactual trajectories with a learning rate of 0.00001 and a gradient step size of 20