Knowledge Retention in Continual Model-Based Reinforcement Learning
Authors: Haotian Fu, Yixiang Sun, Michael Littman, George Konidaris
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios. We evaluated DRAGO on three continual learning domains. In Figure 4, we also visualize the prediction accuracy of the learned world models across the whole gridworld... As shown in Figure 5, we find that the proposed method DRAGO achieves the best overall performance compared to all the other approaches across three domains. This section evaluates the essentiality of DRAGO s components. Specifically, we evaluate DRAGO s performance without Synthetic Experience Rehearsal and Regaining Memories Through Exploration (reviewer) separately in four transfer tasks of Cheetah and Mini Grid. As we show in Figure 7... |
| Researcher Affiliation | Academia | 1Brown University. Correspondence to: Haotian Fu <EMAIL>, Yixiang Sun <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 DRAGO (Training process for each task) Algorithm 2 update learner and reviewer Algorithm 3 update vae Algorithm 4 update transition from synthetic data |
| Open Source Code | Yes | Code is available at https://github.com/Yixiang Sun/drago. |
| Open Datasets | Yes | We evaluated the performance of DRAGO in the Mini Grid (Chevalier-Boisvert et al., 2023) domain using a sequence of four tasks... We also evaluated the performance of DRAGO in the Cheetah and Walker domains from the Deepmind Control Suite (Tassa et al., 2018). |
| Dataset Splits | Yes | Task names in Blue denote the continual training tasks; Task names in Red denote the test tasks. We train and test all the tasks in the order of left to right as in the figure. E.g., we train the cheetah agent in the order of run, jump and backward. And after training on jump, we test on jump2run and jump&run. The evaluation was conducted on four new tasks that require the agent to move between different rooms (e.g., start in room 1 and move to the goal position in room 2). The evaluation was conducted on several new tasks that require the agent to quickly change to different locomotion modes from another mode (jump, run, etc.). |
| Hardware Specification | No | This work was conducted using computational resources and services at the Center for Computation and Visualization, Brown University. |
| Software Dependencies | No | We implement DRAGO on top of TDMPC (Hansen et al., 2022) and the overall algorithm is described in Algorithm 1. |
| Experiment Setup | Yes | Table 2. Here we list the hyperparameters used for Mini Grid World, DM-Control cheetah, and DM-Control walker. Unlisted hyperparameters are all identical to the default parameters in TD-MPC. action repeat 1, 4, 2 discount factor 0.99 batch size 512 maximum steps 100, 1000, 1000 planning horizon 10, (25, 15), 15 policy fraction 0.05 temperature 0.5 momentum 0.1 reward loss coef 0.5 value coef 0.1 consistency loss coef 2 vae recon loss coef 1 vae kl loss coef 0.02 temporal loss discount (ρ) 0.5 learning rate 1e-3 sampling technique PER(0.6, 0.4) target networks update freq 40, 2, 2 temperature (τ) 0.01 cost coef for reviewer reward (α) 0.5 vae latent dim 64, 256, 256 vae encoding dim 128 mlp latent dim 512 gumble softmax temp 1.0 steps per synthetic data rehearsal 10, 20 |