General Value Function Networks

Authors: Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White, Martha White

JAIR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we compare GVFNs and RNNs on two time series prediction datasets, particularly to ask 1) can GVFNs obtain comparable performance and 2) do GVFNs allow for faster learning, due to the regularizing effect of constraining the state to be predictions. We investigate if they allow for faster learning both by examining learning speed as well as robustness to truncation length in BPTT. ... In this section, we investigate the utility of constraining states to be predictions, for an environment with long temporal dependencies. We use Compass World, introduced in Section 4 (see Figure 2), which can have long temporal dependencies, because the random behavior can stay in the center of the world for many steps, observing only the color white.
Researcher Affiliation Academia Matthew Schlegel EMAIL Andrew Jacobsen EMAIL Zaheer Abbas EMAIL Andrew Patterson EMAIL Adam White EMAIL Martha White EMAIL Department of Computing Science and the Alberta Machine Intelligence Institute (Amii) University of Alberta, Canada
Pseudocode No The Recurrent TD update for GVFNs is st fθt(st 1, xt) where xt def= [at 1, ot] st+1 fθt(st, xt+1) where xt+1 def= [at, ot+1] φt,j θst,j Compute sensitivities using truncated BPTT δt,j C(j) t+1 + γ(j) t+1st+1,j st,j ρt,j π(j)(at|ot) µ(at|ot) Policies can be functions of histories, not just of ot θt+1 θt + αt j=1 ρt,jδt,jφt,j
Open Source Code Yes 6. All code for these experiments can be found at https://github.com/mkschleg/GVFN
Open Datasets Yes We consider two time series datasets previously studied in a comparative analysis of RNN architectures by (Bianchi, Maiorino, Kampffmeyer, Rizzi, & Jenssen, 2017): the Mackey-Glass time series (previously introduced), and the Multiple Superimposed Oscillator. The single-variate Mackey-Glass (MG) time series dataset is a synthetic data set generated from a time-delay differential equation: ... The Multiple Superimposed Oscillator (MSO) synthetic time series (Jaeger & Haas, 2004) is defined by the sum of four sinusoids with unique frequencies ... We use Compass World, introduced in Section 4 (see Figure 2), which can have long temporal dependencies, because the random behavior can stay in the center of the world for many steps, observing only the color white.
Dataset Splits No At each step t, after observing ot = y(t), the RNN (or GVFN) makes a prediction ˆyt about the target yt, which is the observation 12 steps into the future, yt = y(t + h). The magnitude of the squared error (ˆyt yt)2 depends on the scale of yt. To provide a more scale invariant error, we normalize by the mean of the target a mean predictor. Specifically, for each run, we report average error over windows of size 10000 with the mean predictor is computed for each window. This results in m/10000 normalized squared errors, where m is the length of the time series. We repeat this process 30 times, and average these errors across the 30 runs, and take the square root, to get a Normalized Root Mean Squared Error (NRMSE).
Hardware Specification No We would also like to thank the Alberta Machine Intelligence Institute, IVADO, NSERC and the Canada CIFAR AI Chairs Program for the funding for this research, as well as Compute Canada for the computing resources used for this work.
Software Dependencies No We use standard implementations found in Flux (Innes, 2018). ... The weights for the fully-connected relu layer and the weights for the linear output are trained using ADAM, to minimize the mean squared error between the prediction at time t and target y(t + h).
Experiment Setup Yes We fixed the values for hyperparameters as much as possible, using the previously reported value for the RNN and reasonable defaults for the GVFN. The stepsize is typically difficult to pick ahead of time, and so we sweep that hyperparameter for all the algorithms. We attempted to make the number of hyperparameters swept comparable for all methods, to avoid an unfair advantage. We do not tune the truncation length, as we report results for each truncation length p {1, 2, 4, 8, 16, 32} for all the algorithms. ... The GVFN consists of a single layer of size 32 and 128 (for MG and MSO respectively), corresponding to horizon GVFs. As described in Section 5, each GVF has a constant continuation γ(j) [0.2, 0.95] and cumulant C(j) t = 1 γ(j) ymax t y(t), where ymax t is an incrementally-computed maximum of the observations y(t) up to time t. ... The GVFN is followed by a fully-connected layer with relu activations to produce a non-linear representation, which is linearly weighted to predict the target. The GVFN layer uses a linear activation, with clipping between [-10, 10]... The GVFN was trained using Recurrent TD with a constant learning rate and a batch size of 32. The weights for the fully-connected relu layer and the weights for the linear output are trained using ADAM... We swept the stepsize hyperparameters: the learning rate for the GVFN αGVFN = N 10 k for N {1, 5}, k {3, . . . , 6}, and the learning rate for the fully-connected and output layers αpred = N 10 k for N {1, 5}, k {2, . . . , 5}. We compare to RNNs, LSTMs, and GRUs 7. The network architecture is similar to the GVFN for all recurrent models. The RNN size is set to 32 for MG and 128 for MSO, while the GRU and LSTM have 8 hidden units for MG and 128 for MSO. ... We trained these models using p-BPTT specifically with the ADAM optimizer with a batch size of 32 to minimize the mean squared error between the prediction at time t and y(t + h). We swept the learning rate α = 2 k with k {1, . . . , 20}.