reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BlendRL: A Framework for Merging Symbolic and Neural Policy Learning

Authors: Hikaru Shindo, Quentin Delfosse, Devendra Singh Dhami, Kristian Kersting

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS We outline the benefits of Blend RL over purely neural or symbolic approaches, supported by additional investigations into Blend RL s robustness to environmental changes. Furthermore, we examine the interactions between neural and symbolic components and demonstrate that Blend RL can generate faithful explanations. We specifically aim to answer the following research questions: (Q1) Can Blende RL agents overcome both symbolic and neural agents shortcomings? (Q2) Can Blend RL can produce both neural and symbolic explanations for its action selection? (Q3) Are Blend RL agents robust to environmental changes? (Q4) How do the neural and symbolic modules interact to maximize Blend RL s agents overall performances? ... Table 1: Blend RL surpasses deep and symbolic agents in our evaluation. Human normalized scores of Blend RL to different deep (DQN and PPO) and symbolic (NLRL, NUDGE, SCo Bots, INTERPRETER and INSIGHT) baselines.
Researcher Affiliation	Academia	1Department of Computer Science, Technical University of Darmstadt, Germany 2Dept. of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands 3Hessian Center for Artificial Intelligence (hessian.AI), Germany 4German Research Center for Artificial Intelligence (DFKI), Germany 5Centre for Cognitive Science, Technical University of Darmstadt, Germany
Pseudocode	Yes	Algorithm 1 Blend RL Policy Reasoning Input: πneural θ , πlogic ϕ , V CNN µ , V OC ω , blending function B, state (x, z) 1: β = B(x, z) # Compute the blending weight β 2: action β πneural θ (x) + (1 β) πlogic ϕ (z) # Action is sampled from the mixed policy 3: value = β V CNN µ (x) + (1 β) V OC ω (z) # Compute the state value using β 4: return action, value
Open Source Code	Yes	Our code and resources are openly available.1 https://github.com/ml-research/blendrl
Open Datasets	Yes	Environments. We evaluate Blend RL on Atari Learning Environments (Bellemare et al., 2013), the most popular benchmark for RL (particularly for relational reasoning tasks).
Dataset Splits	No	We train each agent types until all of them converge to a stable episodic return (i.e. for 15K episodes for Kangaroo and Donkey Kong and 25K for Seaquest). This section describes training duration but does not provide specific train/test/validation dataset splits or methodologies for partitioning the data itself.
Hardware Specification	Yes	A.7 EXPERIMENTAL DETAILS Hardwares. All experiments were performed on one NVIDIA A100-SXM4-40GB GPU with Xeon(R):8174 CPU@3.10GHz and 100 GB of RAM.
Software Dependencies	No	We adopted an implementation of the PPO algorithm from the Clean RL project (Huang et al., 2022). ... GPT4-o3 was the chosen LLM that we consistently used in our experiments. While GPT-4o is mentioned, the paper does not specify versions for other core software libraries like Python, PyTorch, or the specific version of the Clean RL implementation, which are necessary for full reproducibility.
Experiment Setup	Yes	A.8 TRAINING DETAILS We hereby provide further details about the training. Details regarding environments will be provided in the next section A.9. We used the Adam optimizer (Kingma & Ba, 2015) for all baselines. Blend RL. We adopted an implementation of the PPO algorithm from the Clean RL project (Huang et al., 2022). Hyperparameters are shown in Table 4. The object-centric critic is described in Table 5. We provide a pseudocode of Blend RL policy reasoning in Algorithm 1. Parameter Value Explanation γ 0.99 Discount factor for future rewards learning rate 0.00025 Learning rate for neural modules logic learning rate 0.00025 Learning rate for logic modules blender learning rate 0.00025 Learning rate for blending module blend ent coef 0.01 Entropy coefficient for blending regularization (Eq. 3) clip coef 0.1 Coefficient for clipping gradients ent coef 0.01 Entropy coefficient for policy optimization max grad norm 0.5 Maximum norm for gradient clipping num envs 512 Number of parallel environments num steps 128 Number of steps per policy rollout total timesteps 20000000 Total number of training timesteps