Learning Task Belief Similarity with Latent Dynamics for Meta-Reinforcement Learning
Authors: Menglong Zhang, Fuyuan Qian, Quanying Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate Sim Belief on sparse-reward tasks in Mu Jo Co (Finn et al., 2017; Rakelly et al., 2019) and the more challenging panda-gym (Gallou edec et al., 2021) environment. We aim to address the following questions: 1. Can Sim Belief achieve fast online adaptation in sparse reward tasks? 2. Can Sim Belief leverage learned latent belief similarity representations to enhance out-of-distribution generalization? 3. What is the impact of latent task representations on rapid exploration? 4. How does the latent space correspond to the real environment? Environments and baselines: We conducted experiments on six complex sparse reward tasks, including Point-Robot-Sparse, Cheetah-Vel-Sparse, Walker-Rand-Params, Panda-Reach, Panda-Push, and Panda-Pick-And-Place (see Appendix E). Online Adaptation Performance. During the training phase, we performed meta-testing by calculating the meta-episode average return and success rate across different tasks to evaluate the algorithm s online performance. As shown in Figure 3, Sim Belief consistently performed well across all tasks and exhibited superior adaptation capabilities compared to other algorithms. |
| Researcher Affiliation | Academia | Menglong Zhang, Fuyuan Qian, Quanying Liu Southern University of Science and Technology EMAIL EMAIL |
| Pseudocode | Yes | The algorithm pseudocode can be found in Appendix C. Algorithm 1 Sim Belief algorithm |
| Open Source Code | Yes | All experiments were conducted using an Nvidia RTX 4090 GPU, the source code is available at: https://github.com/mlzhang-pr/Sim Belief. |
| Open Datasets | Yes | In this section, we evaluate Sim Belief on sparse-reward tasks in Mu Jo Co (Finn et al., 2017; Rakelly et al., 2019) and the more challenging panda-gym (Gallou edec et al., 2021) environment. |
| Dataset Splits | Yes | Table 1: Adaptation length and goal settings for environments used for evaluation Environment # of adaptation Max steps Goal type Goal range Goal radius episodes per episode Cheetah-Vel-Sparse 2 200 Velocity [0,3] 0.5 Point-Robot-Sparse 2 60 Position Semicircle with radius 1 0.3 Walker-Rand-Params 2 200 Velocity 1.5 0.5 Panda-Reach 3 50 Position / 0.05 Panda-Push 3 50 Position / 0.05 Panda-Pick-And-Place 3 50 Position / 0.05 Table 2: Hyperparameter settings for Sim Belief in different environments Parameter Cheetah-Vel-Sparse Point-Robot-Sparse Panda-Reach Panda-Push Name Walker-Rand-Params Panda-Pick-And-Place Number of Tasks 120 100 100 60 Number of Training Tasks 100 80 80 50 Number of Evaluation Tasks 20 20 20 10 |
| Hardware Specification | Yes | All experiments were conducted using an Nvidia RTX 4090 GPU |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | Table 2: Hyperparameter settings for Sim Belief in different environments Parameter Cheetah-Vel-Sparse Point-Robot-Sparse Panda-Reach Panda-Push Name Walker-Rand-Params Panda-Pick-And-Place Number of Tasks 120 100 100 60 Number of Training Tasks 100 80 80 50 Number of Evaluation Tasks 20 20 20 10 Number of Episodes 2 2 3 3 Number of Iterations 1000 2000 1000 4000 RL Updates per Iteration 2000 1000 1000 1000 Batch Size 256 256 256 256 Policy Buffer Size 1e6 1e6 1e6 1e6 VAE Buffer Size 1e5 5e4 5e4 5e4 Policy Layers [128, 128, 128] [128, 128] [128, 128] [128, 128, 128] Actor Learning Rate 0.0003 0.00007 0.00007 0.00007 Critic Learning Rate 0.0003 0.00007 0.00007 0.00007 Discount Factor (γ) 0.99 0.9 0.9 0.9 Entropy Alpha 0.2 0.01 0.01 0.01 VAE Updates per Iteration 20 25 25 25 VAE Learning Rate 0.0003 0.001 0.001 0.001 KL Weight 1.0 0.1 0.1 0.1 Task Embedding Size 10 10 5 5 |