Rule-Guided Reinforcement Learning Policy Evaluation and Improvement

Authors: Martin Tappler, Ignacio D. Lopez-Miguel, Sebastian Tschiatschek, Ezio Bartocci

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward.
Researcher Affiliation Academia 1TU Wien 2University of Vienna EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Policy evaluation guided by Rules Input: Q-function Q, set of gen. rules G, # eval. episodes n Output: Average Cumulative Reward 1: Rews ← [] 2: for i ← 1 to n do 3: s ← RESET(), rew ← 0 4: while s not terminal do 5: Gt ← {(action(r), polarity(r)) | r ∈ G, s ∣= r} 6: if |{(a, +) ∈ Gt}| = 1 then 7: act ← a 8: else 9: q ← Q(s, ∈) 10: q(s, a) ← -∞ for (a, -) ∈ Gt 11: act ← arg maxa q(s, a) 12: s, r ← STEP(s, act), rew ← rew + r 13: APPEND(Rews, rew) 14: return mean(Rews), stderr(Rews)
Open Source Code Yes Code and data from the experiments are available at https://doi.org/10.6084/m9.figshare.28569017.
Open Datasets Yes All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021].
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits with exact percentages, counts, or predefined splits. It describes how experiences are generated for rule mining but not formal dataset splits for model training.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions the environments and policies used.
Software Dependencies Yes All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021].
Experiment Setup Yes We trained DQN [Mnih et al., 2015] policies for 2.5 × 10^6 steps in the small and medium PAC-Man environments with 69-dimensional states and for 5 × 10^6 steps in the original-sized environments with 117-dimensional states. In highway-env, we trained DQN policies for 5 × 10^5 steps to navigate in driving scenarios... For both types of environments, we sample nrule = 600 episodes to generate experiences for rule mining and impose a minimal accuracy of 0.9 and a minimal coverage of 0.01 on rules, discarding all other rules.