Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning
Authors: Matthew Kyle Schlegel, Volodymyr Tkachuk, Adam M White, Martha White
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we focus on several architectures for incorporating action into the state-update function of an RNN in partially observable RL settings. Many of these architectures have been proposed previously for recurrent architectures (i.e. Zhu et al. (2017); Schlegel et al. (2021)), and others are either related to or obvious extensions of those architectures. We perform an in-depth empirical evaluation on several illustrative domains, and outline the relationship between the domain and architectures. Finally, we discuss future work in developing recurrent architectures designed for the RL problem and discuss challenges specific to the RL setting needing investigation in the future. |
| Researcher Affiliation | Academia | Matthew Schlegel EMAIL University of Alberta Volodymyr Tkachuk EMAIL University of Alberta Adam White EMAIL University of Alberta Martha White EMAIL University of Alberta |
| Pseudocode | No | The paper describes algorithms and methods using mathematical equations and textual descriptions, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like procedural steps. |
| Open Source Code | Yes | All code for the following experiments can be found at https://github.com/mkschleg/Action RNNs.jl and is written in Julia (Bezanson et al., 2017), and we use Flux and Zygote as our deep learning and auto-diff backend (Innes, 2018b;a). |
| Open Datasets | Yes | The agent observes an even (or odd) number sampled from the MNIST (Le Cun et al., 2010) dataset when facing the direction of (or opposite of) the goal. |
| Dataset Splits | No | The paper describes training agents within various reinforcement learning environments (Ring World, TMaze, Lunar Lander) for a specified number of steps (e.g., 300,000 steps, 4,000,000 steps) and reports performance metrics averaged over multiple independent runs (e.g., 'averaged over 50 independent runs'). It discusses evaluation over the 'final 10% of episodes' or 'average reward obtained over all episodes'. However, it does not specify traditional training, validation, or test dataset splits for static datasets, as is common in supervised learning. The experimental setup involves continuous interaction with dynamic environments rather than pre-split datasets. |
| Hardware Specification | No | All experiments were run using an off-site cluster. In total, for all sweeps and final experiments we used 20 cpu years, which was approximated based offthe logging information used by the off-site cluster. |
| Software Dependencies | No | All code for the following experiments can be found at https://github.com/mkschleg/Action RNNs.jl and is written in Julia (Bezanson et al., 2017), and we use Flux and Zygote as our deep learning and auto-diff backend (Innes, 2018b;a). |
| Experiment Setup | Yes | Unless otherwise stated, we performed a hyperparameter search for all models using a grid search over various parameters (listed appropriately in the Appendix F). To best to our ability we kept the number of hyperparameter settings to be equivalent across all models... All final network sizes can be found in Appendix F. Appendix F also includes tables like 'Figure 19: Ring World Hyperparameters', 'Figure 21: TMaze Experience Replay experiments: (top left) The hyperparameters used across all cells', and 'Figure 24: Lunar Lander experimental details: (top left) The hyperparameters used across all cells in Lunar Lander', which list specific values for 'Steps', 'Optimizer', 'η', 'ρ', 'Discount γ', 'Truncation τ', 'Buffer Size', 'Batch Size', 'Update freq', and 'Target Network Freq'. |