Bridging the Gap Between Target Networks and Functional Regularization

Authors: Alexandre Piché, Valentin Thomas, Joseph Marino, Rafael Pardinas, Gian Maria Marconi, Christopher Pal, Mohammad Emtiyaz Khan

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experimental study, we explored a variety of environments, including the two-state MDP (Tsitsiklis & Van Roy, 1996), the Four Rooms environment (Sutton et al., 1999), and the Atari suite (Bellemare et al., 2013), to assess the efficacy of regularization introduced by TN and FR in relation to performance, accuracy, and divergence. Our findings emphasize that Functional Regularization without regularization weight tuning can be used as a drop-in replacement for Target Networks without loss of performance and can result in performance improvement. Additionally, the combined use of the additional regularization weight and the network update period in FR can lead to enhanced performance compared to merely tuning the network update period for TN. Section 4: Experiments
Researcher Affiliation Collaboration Alexandre Piché EMAIL Service Now Research Mila, Université de Montréal Valentin Thomas EMAIL Mila, Université de Montréal Rafael Pardinas EMAIL Service Now Research Joseph Marino EMAIL Deep Mind, London Gian Maria Marconi EMAIL RIKEN Center for Advanced Intelligence Project Christopher Pal EMAIL Mila, Polytechnique Montréal Canada CIFAR AI Chair Mohammad Emtiyaz Khan EMAIL RIKEN Center for Advanced Intelligence Project
Pseudocode Yes Algorithm 1 Deep Q-Network (DQN) Algorithm with TN or FR
Open Source Code Yes The code is available here https://github.com/Alex Piche/fr-tmlr/.
Open Datasets Yes In our experimental study, we explored a variety of environments, including the two-state MDP (Tsitsiklis & Van Roy, 1996), the Four Rooms environment (Sutton et al., 1999), and the Atari suite (Bellemare et al., 2013)
Dataset Splits No The paper describes generating data through interaction with environments (e.g., 'collect 10000 environment transitions', 'run each algorithm for 10M steps') but does not specify explicit training/test/validation splits for a static dataset. The nature of RL often involves on-the-fly data generation rather than pre-split datasets.
Hardware Specification No The paper mentions 'approximately 60, 000 GPU hours' and 'a total of 30,000 GPU hours' but does not specify any particular GPU models, CPU models, or other hardware specifications used for the experiments.
Software Dependencies No The paper mentions using the 'Clean RL library (Huang et al., 2022)' and the 'rliable library' and 'Adam optimizer (Kingma & Ba, 2014)', but does not provide specific version numbers for these software components or any programming language used.
Experiment Setup Yes Table 1: Four Rooms Hyper-parameters Hyperparameter Value learning rate 1e-4 optimizer adam (Kingma & Ba, 2014) discount factor γ 0.99 DNN layers [128, 128, 4] dimension 11 11 Section 4.4.1 Experimental Set-Up: 'For each environment, we decay the probability of a random action from 1 to ϵ and the discount factor γ use to train the Q-value. We report the results for different γ and ϵ since they can both increase instability and result in divergence. Unless specified otherwise, we use the default hyper-parameters from the Clean RL library (Huang et al., 2022).'