reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Task Belief Similarity with Latent Dynamics for Meta-Reinforcement Learning

Authors: Menglong Zhang, Fuyuan Qian, Quanying Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate Sim Belief on sparse-reward tasks in Mu Jo Co (Finn et al., 2017; Rakelly et al., 2019) and the more challenging panda-gym (Gallou edec et al., 2021) environment. We aim to address the following questions: 1. Can Sim Belief achieve fast online adaptation in sparse reward tasks? 2. Can Sim Belief leverage learned latent belief similarity representations to enhance out-of-distribution generalization? 3. What is the impact of latent task representations on rapid exploration? 4. How does the latent space correspond to the real environment? Environments and baselines: We conducted experiments on six complex sparse reward tasks, including Point-Robot-Sparse, Cheetah-Vel-Sparse, Walker-Rand-Params, Panda-Reach, Panda-Push, and Panda-Pick-And-Place (see Appendix E). Online Adaptation Performance. During the training phase, we performed meta-testing by calculating the meta-episode average return and success rate across different tasks to evaluate the algorithm s online performance. As shown in Figure 3, Sim Belief consistently performed well across all tasks and exhibited superior adaptation capabilities compared to other algorithms.
Researcher Affiliation	Academia	Menglong Zhang, Fuyuan Qian, Quanying Liu Southern University of Science and Technology EMAIL EMAIL
Pseudocode	Yes	The algorithm pseudocode can be found in Appendix C. Algorithm 1 Sim Belief algorithm
Open Source Code	Yes	All experiments were conducted using an Nvidia RTX 4090 GPU, the source code is available at: https://github.com/mlzhang-pr/Sim Belief.
Open Datasets	Yes	In this section, we evaluate Sim Belief on sparse-reward tasks in Mu Jo Co (Finn et al., 2017; Rakelly et al., 2019) and the more challenging panda-gym (Gallou edec et al., 2021) environment.
Dataset Splits	Yes	Table 1: Adaptation length and goal settings for environments used for evaluation Environment # of adaptation Max steps Goal type Goal range Goal radius episodes per episode Cheetah-Vel-Sparse 2 200 Velocity [0,3] 0.5 Point-Robot-Sparse 2 60 Position Semicircle with radius 1 0.3 Walker-Rand-Params 2 200 Velocity 1.5 0.5 Panda-Reach 3 50 Position / 0.05 Panda-Push 3 50 Position / 0.05 Panda-Pick-And-Place 3 50 Position / 0.05 Table 2: Hyperparameter settings for Sim Belief in different environments Parameter Cheetah-Vel-Sparse Point-Robot-Sparse Panda-Reach Panda-Push Name Walker-Rand-Params Panda-Pick-And-Place Number of Tasks 120 100 100 60 Number of Training Tasks 100 80 80 50 Number of Evaluation Tasks 20 20 20 10
Hardware Specification	Yes	All experiments were conducted using an Nvidia RTX 4090 GPU
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	Table 2: Hyperparameter settings for Sim Belief in different environments Parameter Cheetah-Vel-Sparse Point-Robot-Sparse Panda-Reach Panda-Push Name Walker-Rand-Params Panda-Pick-And-Place Number of Tasks 120 100 100 60 Number of Training Tasks 100 80 80 50 Number of Evaluation Tasks 20 20 20 10 Number of Episodes 2 2 3 3 Number of Iterations 1000 2000 1000 4000 RL Updates per Iteration 2000 1000 1000 1000 Batch Size 256 256 256 256 Policy Buffer Size 1e6 1e6 1e6 1e6 VAE Buffer Size 1e5 5e4 5e4 5e4 Policy Layers [128, 128, 128] [128, 128] [128, 128] [128, 128, 128] Actor Learning Rate 0.0003 0.00007 0.00007 0.00007 Critic Learning Rate 0.0003 0.00007 0.00007 0.00007 Discount Factor (γ) 0.99 0.9 0.9 0.9 Entropy Alpha 0.2 0.01 0.01 0.01 VAE Updates per Iteration 20 25 25 25 VAE Learning Rate 0.0003 0.001 0.001 0.001 KL Weight 1.0 0.1 0.1 0.1 Task Embedding Size 10 10 5 5