Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation
Authors: Qingfeng Lan, Yangchen Pan, Jun Luo, A. Rupam Mahmood
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose memory-efficient RL algorithms based on the deep Q-network (DQN) algorithm. Specifically, we assign a new role to the target neural network, which was introduced originally to stabilize training (Mnih et al., 2015). In our algorithms, the target neural network plays the role of a knowledge keeper and helps consolidate knowledge in the action-value network through a consolidation loss. We also introduce a tuning parameter to balance learning new knowledge and remembering past knowledge. With the experiments in both feature-based and image-based environments, we demonstrate that our algorithms, while using an experience replay buffer at least 10 times smaller compared to the experience replay buffer for DQN, still achieve comparable or even better performance. |
| Researcher Affiliation | Collaboration | Qingfeng Lan EMAIL Department of Computing Science University of Alberta Yangchen Pan EMAIL University of Oxford Jun Luo EMAIL Huawei Noah s Ark Lab A. Rupam Mahmood EMAIL Department of Computing Science University of Alberta CIFAR AI Chair, Amii |
| Pseudocode | Yes | A The pseudocodes of DQN and Me DQN(R) Algorithm 2 Deep Q-learning with experience replay (DQN) Algorithm 3 Memory-efficient DQN with real state sampling |
| Open Source Code | Yes | 1Code release: https://github.com/qlan3/Me DQN |
| Open Datasets | Yes | We chose four tasks with low-dimensional inputs from Gym (Brockman et al., 2016) and Py Game Learning Environment (Tasfi, 2016): Mountain Car-v0 (2), Acrobot-v1 (6), Catcher (4), and Pixelcopter (7), where numbers in parentheses are input state dimensions. To further evaluate our algorithms, we chose four tasks with high-dimensional image inputs from Min Atar (Young & Tian, 2019): Asterix (10 10 4), Seaquest (10 10 10), Breakout (10 10 4), and Space Invaders (10 10 6) To further evaluate our algorithm, we conducted experiments in Atari games (Bellemare et al., 2013). Specifically, we selected five representative games recommended by Aitchison et al. (2022), including Battlezone, Double Dunk, Name This Game, Phoenix, and Qbert. |
| Dataset Splits | Yes | To get non-IID input data, we consider two-stage training. In Stage 1, we generated training samples (x, y), where x [0, 1] and y = sin(πx). For Stage 2, x [1, 2] and y = sin(πx). |
| Hardware Specification | No | The paper provides a general statement about |
| Software Dependencies | No | The paper mentions software like |
| Experiment Setup | Yes | The mini-batch size was 32. The discount factor was 0.99. The best learning rate was selected from {1e 2, 3e 3, 1e 3, 3e 4, 1e 4} with grid search; Adam was used to optimize network parameters; all algorithms were trained for 1e5 steps. ... For Me DQN, λstart = 0.01; λend was chosen from {1, 2, 4}; E was selected from {1, 2, 4}. We chose Ccurrent in {1, 2, 4, 8}. Other hyper-parameter choices are presented in Table 3 6. |