BlendRL: A Framework for Merging Symbolic and Neural Policy Learning
Authors: Hikaru Shindo, Quentin Delfosse, Devendra Singh Dhami, Kristian Kersting
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS We outline the benefits of Blend RL over purely neural or symbolic approaches, supported by additional investigations into Blend RL s robustness to environmental changes. Furthermore, we examine the interactions between neural and symbolic components and demonstrate that Blend RL can generate faithful explanations. We specifically aim to answer the following research questions: (Q1) Can Blende RL agents overcome both symbolic and neural agents shortcomings? (Q2) Can Blend RL can produce both neural and symbolic explanations for its action selection? (Q3) Are Blend RL agents robust to environmental changes? (Q4) How do the neural and symbolic modules interact to maximize Blend RL s agents overall performances? ... Table 1: Blend RL surpasses deep and symbolic agents in our evaluation. Human normalized scores of Blend RL to different deep (DQN and PPO) and symbolic (NLRL, NUDGE, SCo Bots, INTERPRETER and INSIGHT) baselines. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Technical University of Darmstadt, Germany 2Dept. of Mathematics and Computer Science, Eindhoven University of Technology, Netherlands 3Hessian Center for Artificial Intelligence (hessian.AI), Germany 4German Research Center for Artificial Intelligence (DFKI), Germany 5Centre for Cognitive Science, Technical University of Darmstadt, Germany |
| Pseudocode | Yes | Algorithm 1 Blend RL Policy Reasoning Input: πneural θ , πlogic ϕ , V CNN µ , V OC ω , blending function B, state (x, z) 1: β = B(x, z) # Compute the blending weight β 2: action β πneural θ (x) + (1 β) πlogic ϕ (z) # Action is sampled from the mixed policy 3: value = β V CNN µ (x) + (1 β) V OC ω (z) # Compute the state value using β 4: return action, value |
| Open Source Code | Yes | Our code and resources are openly available.1 https://github.com/ml-research/blendrl |
| Open Datasets | Yes | Environments. We evaluate Blend RL on Atari Learning Environments (Bellemare et al., 2013), the most popular benchmark for RL (particularly for relational reasoning tasks). |
| Dataset Splits | No | We train each agent types until all of them converge to a stable episodic return (i.e. for 15K episodes for Kangaroo and Donkey Kong and 25K for Seaquest). This section describes training duration but does not provide specific train/test/validation dataset splits or methodologies for partitioning the data itself. |
| Hardware Specification | Yes | A.7 EXPERIMENTAL DETAILS Hardwares. All experiments were performed on one NVIDIA A100-SXM4-40GB GPU with Xeon(R):8174 CPU@3.10GHz and 100 GB of RAM. |
| Software Dependencies | No | We adopted an implementation of the PPO algorithm from the Clean RL project (Huang et al., 2022). ... GPT4-o3 was the chosen LLM that we consistently used in our experiments. While GPT-4o is mentioned, the paper does not specify versions for other core software libraries like Python, PyTorch, or the specific version of the Clean RL implementation, which are necessary for full reproducibility. |
| Experiment Setup | Yes | A.8 TRAINING DETAILS We hereby provide further details about the training. Details regarding environments will be provided in the next section A.9. We used the Adam optimizer (Kingma & Ba, 2015) for all baselines. Blend RL. We adopted an implementation of the PPO algorithm from the Clean RL project (Huang et al., 2022). Hyperparameters are shown in Table 4. The object-centric critic is described in Table 5. We provide a pseudocode of Blend RL policy reasoning in Algorithm 1. Parameter Value Explanation γ 0.99 Discount factor for future rewards learning rate 0.00025 Learning rate for neural modules logic learning rate 0.00025 Learning rate for logic modules blender learning rate 0.00025 Learning rate for blending module blend ent coef 0.01 Entropy coefficient for blending regularization (Eq. 3) clip coef 0.1 Coefficient for clipping gradients ent coef 0.01 Entropy coefficient for policy optimization max grad norm 0.5 Maximum norm for gradient clipping num envs 512 Number of parallel environments num steps 128 Number of steps per policy rollout total timesteps 20000000 Total number of training timesteps |