reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rule-Guided Reinforcement Learning Policy Evaluation and Improvement

Authors: Martin Tappler, Ignacio D. Lopez-Miguel, Sebastian Tschiatschek, Ezio Bartocci

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward.
Researcher Affiliation	Academia	1TU Wien 2University of Vienna EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Policy evaluation guided by Rules Input: Q-function Q, set of gen. rules G, # eval. episodes n Output: Average Cumulative Reward 1: Rews ← [] 2: for i ← 1 to n do 3: s ← RESET(), rew ← 0 4: while s not terminal do 5: Gt ← {(action(r), polarity(r)) \| r ∈ G, s ∣= r} 6: if \|{(a, +) ∈ Gt}\| = 1 then 7: act ← a 8: else 9: q ← Q(s, ∈) 10: q(s, a) ← -∞ for (a, -) ∈ Gt 11: act ← arg maxa q(s, a) 12: s, r ← STEP(s, act), rew ← rew + r 13: APPEND(Rews, rew) 14: return mean(Rews), stderr(Rews)
Open Source Code	Yes	Code and data from the experiments are available at https://doi.org/10.6084/m9.figshare.28569017.
Open Datasets	Yes	All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021].
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits with exact percentages, counts, or predefined splits. It describes how experiences are generated for rule mining but not formal dataset splits for model training.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. It only mentions the environments and policies used.
Software Dependencies	Yes	All experiments are based on RL policies trained in six PAC-Man levels [De Nero and Klein, 2010] and five highway-env [Leurent, 2018] environments using stable-baselines3 [Raffin et al., 2021].
Experiment Setup	Yes	We trained DQN [Mnih et al., 2015] policies for 2.5 × 10^6 steps in the small and medium PAC-Man environments with 69-dimensional states and for 5 × 10^6 steps in the original-sized environments with 117-dimensional states. In highway-env, we trained DQN policies for 5 × 10^5 steps to navigate in driving scenarios... For both types of environments, we sample nrule = 600 episodes to generate experiences for rule mining and impose a minimal accuracy of 0.9 and a minimal coverage of 0.01 on rules, discarding all other rules.