Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning

Authors: Tongzhou Mu, Kaixiang Lin, Feiyang Niu, Govind Thattai

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental studies on four levels of complex text-based games have demonstrated the superiority of the proposed method compared to the state-of-the-art. We evaluate our method on Text World, which is a framework for designing text-based interactive games. More specifically, we use the Text World games generated by GATA Adhikari et al. (2020). Table 3 shows the normalized scores of different methods on both training environments and test environment in Text World. Table 4 shows the performance of vanilla RL and our method under noisy input graphs generated in the above mentioned way. In this section, we study the contributions of different modules in our method.
Researcher Affiliation Collaboration Tongzhou Mu EMAIL Department of Computer Science and Engineering University of California San Diego Kaixiang Lin EMAIL Amazon Feiyang Niu EMAIL Amazon Govind Thattai EMAIL Amazon
Pseudocode No The paper describes the two-step hybrid decision-making process and the rule mining process in detail using natural language and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate our method on Text World, which is a framework for designing text-based interactive games. More specifically, we use the Text World games generated by GATA Adhikari et al. (2020).
Dataset Splits Yes The games have four different difficulty levels, and each difficulty level contains 20 training, 20 validation, and 20 test environments, which are sampled from a distribution based on the difficulty level.
Hardware Specification No The paper mentions training models and experiments but does not provide any specific details about the hardware used (e.g., GPU models, CPU types, or cloud computing instance specifications).
Software Dependencies No The paper mentions several software components and frameworks used, such as "fastText Mikolov et al. (2017)", "Relational-GCN", "DQN Mnih et al. (2015)", "GCN", and "GTN Yun et al. (2019)". However, it does not specify version numbers for these or other ancillary software components.
Experiment Setup Yes To collect demonstration dataset, we first train a teacher policy by DQN Mnih et al. (2015) in the training environments, which can converge to a near-optimal solution. The trained teacher policy is used to collect 300K samples through the interaction with the environment, and label them with the taken actions, as illustrated in Sec 4.3.1. When collecting the demonstration dataset, we use ϵ-greedy exploration strategy to increase the diversity of states. We want to train a classifier fp(s; θ) = k, where k {1, 2, ..., K} is an action type. This is a conventional classification problem which can be solved by minimizing cross entropy loss: θ = arg min θ X j=1 kj i log(f j θ(si)). Then we can get the ASE(Ak) by selecting the edges with the importance higher than a threshold, i.e., ASE(Ak) = {e|Ia(e) > τ}, where τ is a hyperparameter shared across all action types.