Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning
Authors: Tongzhou Mu, Kaixiang Lin, Feiyang Niu, Govind Thattai
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental studies on four levels of complex text-based games have demonstrated the superiority of the proposed method compared to the state-of-the-art. We evaluate our method on Text World, which is a framework for designing text-based interactive games. More specifically, we use the Text World games generated by GATA Adhikari et al. (2020). Table 3 shows the normalized scores of different methods on both training environments and test environment in Text World. Table 4 shows the performance of vanilla RL and our method under noisy input graphs generated in the above mentioned way. In this section, we study the contributions of different modules in our method. |
| Researcher Affiliation | Collaboration | Tongzhou Mu EMAIL Department of Computer Science and Engineering University of California San Diego Kaixiang Lin EMAIL Amazon Feiyang Niu EMAIL Amazon Govind Thattai EMAIL Amazon |
| Pseudocode | No | The paper describes the two-step hybrid decision-making process and the rule mining process in detail using natural language and mathematical formulations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We evaluate our method on Text World, which is a framework for designing text-based interactive games. More specifically, we use the Text World games generated by GATA Adhikari et al. (2020). |
| Dataset Splits | Yes | The games have four different difficulty levels, and each difficulty level contains 20 training, 20 validation, and 20 test environments, which are sampled from a distribution based on the difficulty level. |
| Hardware Specification | No | The paper mentions training models and experiments but does not provide any specific details about the hardware used (e.g., GPU models, CPU types, or cloud computing instance specifications). |
| Software Dependencies | No | The paper mentions several software components and frameworks used, such as "fastText Mikolov et al. (2017)", "Relational-GCN", "DQN Mnih et al. (2015)", "GCN", and "GTN Yun et al. (2019)". However, it does not specify version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | To collect demonstration dataset, we first train a teacher policy by DQN Mnih et al. (2015) in the training environments, which can converge to a near-optimal solution. The trained teacher policy is used to collect 300K samples through the interaction with the environment, and label them with the taken actions, as illustrated in Sec 4.3.1. When collecting the demonstration dataset, we use ϵ-greedy exploration strategy to increase the diversity of states. We want to train a classifier fp(s; θ) = k, where k {1, 2, ..., K} is an action type. This is a conventional classification problem which can be solved by minimizing cross entropy loss: θ = arg min θ X j=1 kj i log(f j θ(si)). Then we can get the ASE(Ak) by selecting the edges with the importance higher than a threshold, i.e., ASE(Ak) = {e|Ia(e) > τ}, where τ is a hyperparameter shared across all action types. |