Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Agent-State Construction with Auxiliary Inputs
Authors: Ruo Yu Tao, Adam White, Marlos C. Machado
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance. [...] We summarize the performance of the agents that leverage auxiliary inputs in Figure 2a. The policy learned with any of the three auxiliary inputs converges to a higher return than the agent using only observations. [...] We compare our particle-filter-based auxiliary inputs to other agent-state functions for Modified Compass World and Rock Sample in Figures 3a and 3b respectively. [...] We show our results for both environments in Figures 4b and 4d. |
| Researcher Affiliation | Academia | Ruo Yu Tao1, Adam White 1, 2, Marlos C. Machado1, 2 1 Department of Computing Science, University of Alberta 2 Canada CIFAR AI Chair, Alberta Machine Intelligence Institute (Amii) EMAIL |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of algorithms (e.g., Sarsa(0), particle filtering steps with equations), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections formatted like code. |
| Open Source Code | Yes | Code and implementation for this work is publically available1. 1https://github.com/taodav/aux-inputs |
| Open Datasets | Yes | We evaluate this approach on classic partially observable environments: A modified version of the Compass World (Rafols et al., 2005a) environment and the Rock Sample (Smith & Simmons, 2004) environment. [...] Frame Stacking. As an auxiliary input, frame stacking only considers the past four observations the agent has seen. This means that the auxiliary input is only defined by the function Mt = Mh, where Mh is the function that concatenates the past 3 observations and the current observation together: Mh({o0, a0, ..., ot, at}) .= ot 3 ot 2 ot 1 ot where the operation represents the concatenation operation. This auxiliary input function produces a fixed-length vector Mt R4n, where n is the size of the observations. This is exactly the frame stacking technique ubiquitous throughout Atari-2600 experiments. |
| Dataset Splits | No | The paper does not provide specific train/test/validation dataset splits with percentages, sample counts, or explicit splitting methodologies for static datasets. It mentions evaluation over a number of runs and steps for reinforcement learning environments: "All hyperparameter sweeps were done over 30 seeds, with the best hyperparameters decided for each algorithm based on mean undiscounted returns over these 200 time steps over these 250K steps and 30 seeds. The results reported are over 30 different additional seeds, with each additional seed run on the selected hyperparameters." For Fishing environments: "Offline evaluations are conducted every evaluation frequency steps (as listed above). We run 5 test episodes per offline evaluation" |
| Hardware Specification | No | The paper acknowledges "support with computational resources" but does not provide specific hardware details such as CPU/GPU models, memory, or detailed cloud instance specifications used for running experiments. |
| Software Dependencies | No | The paper mentions software components and algorithms such as "Sarsa(0) algorithm", "Linear function approximation", "Adam Optimizer", "Proximal Policy Optimization (PPO)", and "LSTM" but does not specify version numbers for any of these, or for any programming languages or libraries used. |
| Experiment Setup | Yes | Learning algorithm: Sarsa(0) Function approximator: Linear Optimizer: Adam Discount rate: γ = 0.9 Environment train steps: 250K Max episode steps: 200 Step sizes: [10^-2, 10^-3, 10^-4, 10^-5] [...] For the trace decay agent, a step size of α .= 10^-3 was selected from a hyperparameter sweep, with an epsilon of ϵ .= 0.1 for the epsilon-greedy policy. [...] In our Rock Sample(7, 8) experiments, we leverage a replay buffer (Lin, 1992) for all of our experience. [...] Buffer size: [10K, 100K] Number of particles: 100 [...] For the Fishing experiments, we use a convolutional neural network to parse our agent map tensor. [...] Batch size: 64 [...] Step sizes: [10^-4, 10^-5, 10^-6, 10^-7] [...] For our 11 11 grid world, this amounts to a square observation of length 11+11 1 = 21 the dimensions beyond the length 11 of the grid world account for the agent-centric view when the agent is at the edges of the grid world. |