Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Symbolic Task Inference in Deep Reinforcement Learning
Authors: Hosein Hasanbeig, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham, Daniel Kroening
JAIR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper proposes Deep Synth, a method for effective training of deep reinforcement learning agents when the reward is sparse or non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact finite state automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton, so that the generation of a control policy by deep reinforcement learning is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse or non-Markovian rewards. We have evaluated Deep Synth s performance in a set of experiments that includes the Atari game Montezuma s Revenge, known to be challenging. |
| Researcher Affiliation | Collaboration | Hosein Hasanbeig EMAIL Microsoft Research ... The work reported in this paper was done while Hosein Hasanbeig and Daniel Kroening were at the University of Oxford. ... Natasha Yogananda Jeppu EMAIL Department of Computer Science, University of Oxford ... Alessandro Abate EMAIL Department of Computer Science, University of Oxford ... Tom Melham EMAIL Department of Computer Science, University of Oxford ... Daniel Kroening EMAIL Amazon |
| Pseudocode | Yes | Algorithm 1: An Episode of Temporal DQN in Deep Synth Algorithm 2: An Episode of Temporal NFQ in Deep Synth |
| Open Source Code | Yes | Full instructions on how to run Deep Synth are provided on a Git Hub page that accompanies the distribution (Hasanbeig et al., 2021b): www.github.com/grockious/deepsynth (@018e606) |
| Open Datasets | Yes | The Minecraft environment (minecraft-t X) taken from Andreas et al. (2017) ... The two mars-rover benchmarks are taken from Hasanbeig (2020) ... Montezuma s Revenge ... The example robot-surve is adopted from Sadigh et al. (2014) ... Models slp-easy and slp-hard are inspired by the noisy MDPs of Chapter 6 in (Sutton & Barto, 1998) ... The frozen-lake MDPs are stochastic and are adopted from the Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | Yes | The size of the replay buffer for each module is limited and in the case of our running example |Eqi| = 15000. ... The values and descriptions of all the hyper-parameters are provided in Table 2. ... minibatch size 32 ... replay memory size 150000 ... agent history length 4 ... replay start size 8000 |
| Hardware Specification | Yes | All simulations have been carried out on a machine with an Intel Xeon 3.5 GHz processor, Nvidia Tesla V100 GPU and 16 GB of RAM, running Ubuntu 18. |
| Software Dependencies | No | The paper mentions 'Ubuntu 18' as the operating system and uses various algorithms and optimizers like 'DQN', 'NFQ', 'RMSProp', 'Adam optimiser', 'SAT solver', and 'DPLL', but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | Table 2: Hyper-parameters of the DQN Modules for Montezuma s Revenge. This table provides specific values for: minibatch size 32, replay memory size 150000, agent history length 4, target network update frequency (TC) 10000, discount factor 0.99, learning rate 0.00025, initial exploration par 1, final exploration par 0.1, final exploration frame 150,000, replay start size 8000, no-op max 30. |