Bridging the Gap Between Target Networks and Functional Regularization
Authors: Alexandre Piché, Valentin Thomas, Joseph Marino, Rafael Pardinas, Gian Maria Marconi, Christopher Pal, Mohammad Emtiyaz Khan
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experimental study, we explored a variety of environments, including the two-state MDP (Tsitsiklis & Van Roy, 1996), the Four Rooms environment (Sutton et al., 1999), and the Atari suite (Bellemare et al., 2013), to assess the efficacy of regularization introduced by TN and FR in relation to performance, accuracy, and divergence. Our findings emphasize that Functional Regularization without regularization weight tuning can be used as a drop-in replacement for Target Networks without loss of performance and can result in performance improvement. Additionally, the combined use of the additional regularization weight and the network update period in FR can lead to enhanced performance compared to merely tuning the network update period for TN. Section 4: Experiments |
| Researcher Affiliation | Collaboration | Alexandre Piché EMAIL Service Now Research Mila, Université de Montréal Valentin Thomas EMAIL Mila, Université de Montréal Rafael Pardinas EMAIL Service Now Research Joseph Marino EMAIL Deep Mind, London Gian Maria Marconi EMAIL RIKEN Center for Advanced Intelligence Project Christopher Pal EMAIL Mila, Polytechnique Montréal Canada CIFAR AI Chair Mohammad Emtiyaz Khan EMAIL RIKEN Center for Advanced Intelligence Project |
| Pseudocode | Yes | Algorithm 1 Deep Q-Network (DQN) Algorithm with TN or FR |
| Open Source Code | Yes | The code is available here https://github.com/Alex Piche/fr-tmlr/. |
| Open Datasets | Yes | In our experimental study, we explored a variety of environments, including the two-state MDP (Tsitsiklis & Van Roy, 1996), the Four Rooms environment (Sutton et al., 1999), and the Atari suite (Bellemare et al., 2013) |
| Dataset Splits | No | The paper describes generating data through interaction with environments (e.g., 'collect 10000 environment transitions', 'run each algorithm for 10M steps') but does not specify explicit training/test/validation splits for a static dataset. The nature of RL often involves on-the-fly data generation rather than pre-split datasets. |
| Hardware Specification | No | The paper mentions 'approximately 60, 000 GPU hours' and 'a total of 30,000 GPU hours' but does not specify any particular GPU models, CPU models, or other hardware specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using the 'Clean RL library (Huang et al., 2022)' and the 'rliable library' and 'Adam optimizer (Kingma & Ba, 2014)', but does not provide specific version numbers for these software components or any programming language used. |
| Experiment Setup | Yes | Table 1: Four Rooms Hyper-parameters Hyperparameter Value learning rate 1e-4 optimizer adam (Kingma & Ba, 2014) discount factor γ 0.99 DNN layers [128, 128, 4] dimension 11 11 Section 4.4.1 Experimental Set-Up: 'For each environment, we decay the probability of a random action from 1 to ϵ and the discount factor γ use to train the Q-value. We report the results for different γ and ϵ since they can both increase instability and result in divergence. Unless specified otherwise, we use the default hyper-parameters from the Clean RL library (Huang et al., 2022).' |