Replay-enhanced Continual Reinforcement Learning

Authors: Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, Deheng Ye

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves comparable or better overall performance against state-of-the-art continual learning methods. We conduct comprehensive experiments on a suite of realistic robotic manipulation tasks from the Continual World benchmark (De Lange et al., 2021).
Researcher Affiliation Collaboration Tiantian Zhang EMAIL Tsinghua University; Kevin Z. Shen EMAIL The University of British Columbia; Zichuan Lin EMAIL Tencent; Bo Yuan EMAIL Tsinghua University; Xueqian Wang EMAIL Tsinghua University; Xiu Li EMAIL Tsinghua University; Deheng Ye EMAIL Tencent
Pseudocode Yes Algorithm 1 Replay-Enhanced Continu AL r L (RECALL)
Open Source Code No The paper does not contain any explicit statement about providing source code nor a link to a code repository for the methodology described.
Open Datasets Yes We conduct comprehensive experiments on a suite of realistic robotic manipulation tasks from the Continual World benchmark (De Lange et al., 2021) designed as a testbed for evaluating RL agents with respect to challenges incurred by the continual learning paradigm.
Dataset Splits No The paper describes the arrangement of tasks into sequences (e.g., CW3, CW10, CW20) and the training duration for each task (1M steps), followed by evaluation. However, it does not provide explicit training/validation/test splits of the data *within* each task's dataset, which is common in reinforcement learning where data is collected through interaction rather than pre-split.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions using 'the soft actor-critic (SAC)' algorithm and refers to implementations 'based on (Wołczyk et al., 2021)' and 'from (Wołczyk et al., 2022)'. It also states that 'the maximum entropy coefficient α is tuned automatically according to the adjustment rule provided in (Haarnoja et al., 2018b)'. However, it does not explicitly list specific software dependencies (e.g., Python, PyTorch, CUDA) with their version numbers for the current work.
Experiment Setup Yes We use an implementation of the underlying RL algorithm SAC (Haarnoja et al., 2018a;b; Zhou et al., 2022) based on (Wołczyk et al., 2021), in which the maximum entropy coefficient α is tuned automatically according to the adjustment rule provided in (Haarnoja et al., 2018b). We follow exactly the same experimental setup (including network structure and hyperparameters) from (Wołczyk et al., 2022) for all baselines and the common settings for RECALL, ensuring fair comparison. The actor and critic are implemented as two separate MLP networks, each with 4 hidden layers of 256 units... For each task sequence, we search method-specific regularization coefficient λ for policy distillation of RECALL in {0.01, 0.1, 1, 10, 100}, and the final selected value is 10. Replay buffer size is set to be consistent with that in Perfect Memory and batch size is 128. Table 4: Core hyperparameters used for the underlying SAC algorithm. Parameter: optimizer, Value: Adam; Parameter: learning rate, Value: 1e-3; Parameter: batch size, Value: 128; Parameter: discount factor (γ), Value: 0.99; Parameter: target smoothing coefficient (τ), Value: 0.005; Parameter: target update interval, Value: 1; Parameter: target output std (σt), Value: 0.089; Parameter: replay buffer size, Value: 10^6.