Rule-Guided Reinforcement Learning Policy Evaluation and Improvement
Authors: Martin Tappler, Ignacio D. Lopez-Miguel, Sebastian Tschiatschek, Ezio Bartocci
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward. |
| Researcher Affiliation | Academia | 1TU Wien 2University of Vienna EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Policy evaluation guided by Rules Input: Q-function Q, set of gen. rules G, # eval. episodes n Output: Average Cumulative Reward 1: Rews ← [] 2: for i ← 1 to n do 3: s ← RESET(), rew ← 0 4: while s not terminal do 5: Gt ← {(action(r), polarity(r)) | r ∈ G, s ∣= r} 6: if |{(a, +) ∈ Gt}| = 1 then 7: act ← a 8: else 9: q ← Q(s, ∈) 10: q(s, a) ← -∞ for (a, -) ∈ Gt 11: act ← arg maxa q(s, a) 12: s, r ← STEP(s, act), rew ← rew + r 13: APPEND(Rews, rew) 14: return mean(Rews), stderr(Rews) |
| Open Source Code | Yes | Code and data from the experiments are available at https://doi.org/10.6084/m9.figshare.28569017. |
| Open Datasets | Yes | All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits with exact percentages, counts, or predefined splits. It describes how experiences are generated for rule mining but not formal dataset splits for model training. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions the environments and policies used. |
| Software Dependencies | Yes | All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021]. |
| Experiment Setup | Yes | We trained DQN [Mnih et al., 2015] policies for 2.5 × 10^6 steps in the small and medium PAC-Man environments with 69-dimensional states and for 5 × 10^6 steps in the original-sized environments with 117-dimensional states. In highway-env, we trained DQN policies for 5 × 10^5 steps to navigate in driving scenarios... For both types of environments, we sample nrule = 600 episodes to generate experiences for rule mining and impose a minimal accuracy of 0.9 and a minimal coverage of 0.01 on rules, discarding all other rules. |