ModelDiff: Symbolic Dynamic Programming for Model-Aware Policy Transfer in Deep Q-Learning

Authors: Xiaotian Liu, Jihwan Jeong, Ayal Taitler, Michael Gimelfarb, Scott Sanner

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that MD-DQN matches or outperforms existing TL methods and baselines in both positive and negative transfer settings. ... We evaluated Model Diff-informed MD-DQN (from here out, shortened to just MD-DQN) derived lower bound on three different domains in a transfer learning setting with source and target tasks in each.
Researcher Affiliation Academia 1University of Toronto, Toronto, ON, Canada 2Vector Institute for AI, Toronto, ON, Canada 3Ben-Gurion University of the Negev, Be er Sheva, Israel
Pseudocode Yes Algorithm 1: Model Diff DQN (MD-DQN )
Open Source Code No The paper does not provide a specific link to source code or an explicit statement about its release in the main text or supplementary materials.
Open Datasets No We evaluate MD-DQN on three benchmark domains: POWERGEN, PICK-AND-PLACE, and RESERVOIR. ... All the domains and tasks have been implemented and trained in py RDDLGym (Taitler et al. 2022).
Dataset Splits No The paper describes reinforcement learning experiments within custom environments (POWERGEN, PICK-AND-PLACE, RESERVOIR) and does not mention explicit training/test/validation dataset splits, which are typically used for supervised learning tasks.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU, GPU models, memory).
Software Dependencies No All the domains and tasks have been implemented and trained in py RDDLGym (Taitler et al. 2022). (No version number specified for pyRDDLGym or other software dependencies.)
Experiment Setup Yes Each task is set with a horizon of 20, and the discount factor γ is fixed at 0.9. ... The modified DQN loss for a transition (s, a, s ), derived from LQπs t , is: 2 (ytarget Q(s, a; θ))2 , ytarget = max r(s, a) + γ max a Q(s , a ; θ ), Qπs t (s, a) ... Algorithm 1: Model Diff DQN (MD-DQN ) includes the epsilon-greedy action selection probability ϵ.