Differentiable Information Enhanced Model-Based Reinforcement Learning

Authors: Xiaoyuan Zhang, Xinyan Cai, Bo Liu, Weidong Huang, Song-Chun Zhu, Siyuan Qi, Yaodong Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of our approach in differentiable environments, we provide theoretical analysis and empirical results. Notably, our approach outperforms previous model-based and model-free methods, in multiple challenging tasks involving controllable rigid robots such as humanoid robots motion control and deformable object manipulation.
Researcher Affiliation Academia 1 Institute for Artificial Intelligence, Peking University 2 State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China 3 State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China 4 Institute of automation, Chinese Academy of Sciences
Pseudocode Yes We list the pseudo code in Algorithm 1. Algorithm 1: MB-MIX
Open Source Code No The paper does not contain any explicit statements about releasing code, nor does it provide a link to a code repository.
Open Datasets Yes We then conducted experiments on two benchmarks, Diff RL (Xu et al. 2021) and Brax (Freeman et al. 2021), which contain classic robot control problems. Moreover, in Da XBench(Chen et al. 2022), we demonstrated the effectiveness of our method in differentiable deformable object environments with large state and action spaces.
Dataset Splits No The paper mentions environments/benchmarks like Diff RL, Brax, and Da XBench, but it does not specify any dataset splits (e.g., train/test/validation percentages or counts) within these environments that would be needed for reproduction.
Hardware Specification No The paper mentions the 'Bruce' humanoid robot as an experimental subject but does not provide details on the hardware used to run the experiments or train the models (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions various algorithms and methods (e.g., SHAC, PPO, SAC, Dreamer V3) but does not list any specific software libraries, frameworks, or operating system versions with their respective version numbers.
Experiment Setup Yes In the experiment, our MBMIX algorithm was trained on all six tasks, with a λ = 0.98 and mix-interval set to 1 or 2 depending on the task. The state and action spaces have dimensions 20 and 5, respectively, yielding a reward matrix R R20 5. The initial policy is a matrix θ0 R20 5, and the final policy πθ is obtained via softmax activation: πθ(a|s) = exp(θ(s, a))/ P b exp(θ(s, b)). Lower parallel environments (4 and 8) were used to highlight sample efficiency of model-based methods.