reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation

Authors: Qingfeng Lan, Yangchen Pan, Jun Luo, A. Rupam Mahmood

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose memory-efficient RL algorithms based on the deep Q-network (DQN) algorithm. Specifically, we assign a new role to the target neural network, which was introduced originally to stabilize training (Mnih et al., 2015). In our algorithms, the target neural network plays the role of a knowledge keeper and helps consolidate knowledge in the action-value network through a consolidation loss. We also introduce a tuning parameter to balance learning new knowledge and remembering past knowledge. With the experiments in both feature-based and image-based environments, we demonstrate that our algorithms, while using an experience replay buffer at least 10 times smaller compared to the experience replay buffer for DQN, still achieve comparable or even better performance.
Researcher Affiliation	Collaboration	Qingfeng Lan EMAIL Department of Computing Science University of Alberta Yangchen Pan EMAIL University of Oxford Jun Luo EMAIL Huawei Noah s Ark Lab A. Rupam Mahmood EMAIL Department of Computing Science University of Alberta CIFAR AI Chair, Amii
Pseudocode	Yes	A The pseudocodes of DQN and Me DQN(R) Algorithm 2 Deep Q-learning with experience replay (DQN) Algorithm 3 Memory-efficient DQN with real state sampling
Open Source Code	Yes	1Code release: https://github.com/qlan3/Me DQN
Open Datasets	Yes	We chose four tasks with low-dimensional inputs from Gym (Brockman et al., 2016) and Py Game Learning Environment (Tasfi, 2016): Mountain Car-v0 (2), Acrobot-v1 (6), Catcher (4), and Pixelcopter (7), where numbers in parentheses are input state dimensions. To further evaluate our algorithms, we chose four tasks with high-dimensional image inputs from Min Atar (Young & Tian, 2019): Asterix (10 10 4), Seaquest (10 10 10), Breakout (10 10 4), and Space Invaders (10 10 6) To further evaluate our algorithm, we conducted experiments in Atari games (Bellemare et al., 2013). Specifically, we selected five representative games recommended by Aitchison et al. (2022), including Battlezone, Double Dunk, Name This Game, Phoenix, and Qbert.
Dataset Splits	Yes	To get non-IID input data, we consider two-stage training. In Stage 1, we generated training samples (x, y), where x [0, 1] and y = sin(πx). For Stage 2, x [1, 2] and y = sin(πx).
Hardware Specification	No	The paper provides a general statement about
Software Dependencies	No	The paper mentions software like
Experiment Setup	Yes	The mini-batch size was 32. The discount factor was 0.99. The best learning rate was selected from {1e 2, 3e 3, 1e 3, 3e 4, 1e 4} with grid search; Adam was used to optimize network parameters; all algorithms were trained for 1e5 steps. ... For Me DQN, λstart = 0.01; λend was chosen from {1, 2, 4}; E was selected from {1, 2, 4}. We chose Ccurrent in {1, 2, 4, 8}. Other hyper-parameter choices are presented in Table 3 6.