Differentiable Information Enhanced Model-Based Reinforcement Learning
Authors: Xiaoyuan Zhang, Xinyan Cai, Bo Liu, Weidong Huang, Song-Chun Zhu, Siyuan Qi, Yaodong Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of our approach in differentiable environments, we provide theoretical analysis and empirical results. Notably, our approach outperforms previous model-based and model-free methods, in multiple challenging tasks involving controllable rigid robots such as humanoid robots motion control and deformable object manipulation. |
| Researcher Affiliation | Academia | 1 Institute for Artificial Intelligence, Peking University 2 State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China 3 State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China 4 Institute of automation, Chinese Academy of Sciences |
| Pseudocode | Yes | We list the pseudo code in Algorithm 1. Algorithm 1: MB-MIX |
| Open Source Code | No | The paper does not contain any explicit statements about releasing code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We then conducted experiments on two benchmarks, Diff RL (Xu et al. 2021) and Brax (Freeman et al. 2021), which contain classic robot control problems. Moreover, in Da XBench(Chen et al. 2022), we demonstrated the effectiveness of our method in differentiable deformable object environments with large state and action spaces. |
| Dataset Splits | No | The paper mentions environments/benchmarks like Diff RL, Brax, and Da XBench, but it does not specify any dataset splits (e.g., train/test/validation percentages or counts) within these environments that would be needed for reproduction. |
| Hardware Specification | No | The paper mentions the 'Bruce' humanoid robot as an experimental subject but does not provide details on the hardware used to run the experiments or train the models (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions various algorithms and methods (e.g., SHAC, PPO, SAC, Dreamer V3) but does not list any specific software libraries, frameworks, or operating system versions with their respective version numbers. |
| Experiment Setup | Yes | In the experiment, our MBMIX algorithm was trained on all six tasks, with a λ = 0.98 and mix-interval set to 1 or 2 depending on the task. The state and action spaces have dimensions 20 and 5, respectively, yielding a reward matrix R R20 5. The initial policy is a matrix θ0 R20 5, and the final policy πθ is obtained via softmax activation: πθ(a|s) = exp(θ(s, a))/ P b exp(θ(s, b)). Lower parallel environments (4 and 8) were used to highlight sample efficiency of model-based methods. |