reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

Authors: Matthew Kyle Schlegel, Volodymyr Tkachuk, Adam M White, Martha White

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we focus on several architectures for incorporating action into the state-update function of an RNN in partially observable RL settings. Many of these architectures have been proposed previously for recurrent architectures (i.e. Zhu et al. (2017); Schlegel et al. (2021)), and others are either related to or obvious extensions of those architectures. We perform an in-depth empirical evaluation on several illustrative domains, and outline the relationship between the domain and architectures. Finally, we discuss future work in developing recurrent architectures designed for the RL problem and discuss challenges speciﬁc to the RL setting needing investigation in the future.
Researcher Affiliation	Academia	Matthew Schlegel EMAIL University of Alberta Volodymyr Tkachuk EMAIL University of Alberta Adam White EMAIL University of Alberta Martha White EMAIL University of Alberta
Pseudocode	No	The paper describes algorithms and methods using mathematical equations and textual descriptions, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like procedural steps.
Open Source Code	Yes	All code for the following experiments can be found at https://github.com/mkschleg/Action RNNs.jl and is written in Julia (Bezanson et al., 2017), and we use Flux and Zygote as our deep learning and auto-diﬀ backend (Innes, 2018b;a).
Open Datasets	Yes	The agent observes an even (or odd) number sampled from the MNIST (Le Cun et al., 2010) dataset when facing the direction of (or opposite of) the goal.
Dataset Splits	No	The paper describes training agents within various reinforcement learning environments (Ring World, TMaze, Lunar Lander) for a specified number of steps (e.g., 300,000 steps, 4,000,000 steps) and reports performance metrics averaged over multiple independent runs (e.g., 'averaged over 50 independent runs'). It discusses evaluation over the 'final 10% of episodes' or 'average reward obtained over all episodes'. However, it does not specify traditional training, validation, or test dataset splits for static datasets, as is common in supervised learning. The experimental setup involves continuous interaction with dynamic environments rather than pre-split datasets.
Hardware Specification	No	All experiments were run using an oﬀ-site cluster. In total, for all sweeps and ﬁnal experiments we used 20 cpu years, which was approximated based oﬀthe logging information used by the oﬀ-site cluster.
Software Dependencies	No	All code for the following experiments can be found at https://github.com/mkschleg/Action RNNs.jl and is written in Julia (Bezanson et al., 2017), and we use Flux and Zygote as our deep learning and auto-diﬀ backend (Innes, 2018b;a).
Experiment Setup	Yes	Unless otherwise stated, we performed a hyperparameter search for all models using a grid search over various parameters (listed appropriately in the Appendix F). To best to our ability we kept the number of hyperparameter settings to be equivalent across all models... All ﬁnal network sizes can be found in Appendix F. Appendix F also includes tables like 'Figure 19: Ring World Hyperparameters', 'Figure 21: TMaze Experience Replay experiments: (top left) The hyperparameters used across all cells', and 'Figure 24: Lunar Lander experimental details: (top left) The hyperparameters used across all cells in Lunar Lander', which list specific values for 'Steps', 'Optimizer', 'η', 'ρ', 'Discount γ', 'Truncation τ', 'Buﬀer Size', 'Batch Size', 'Update freq', and 'Target Network Freq'.