Dynamics Adapted Imitation Learning
Authors: Zixuan Liu, Liu Liu, Bingzhe Wu, Lanqing Li, Xueqian Wang, Bo Yuan, Peilin Zhao
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiment evaluation validates that our method achieves superior results on high dimensional continuous control tasks, compared to existing imitation learning methods. We validate the effectiveness of DYNAIL on a variety of high-dimensional continuous control benchmarks with dynamics variations. Section 5 and Appendix show that our algorithm achieves superior results compared to state-of-the-art imitation learning methods. |
| Researcher Affiliation | Collaboration | 1Tsinghua University, 2Tecent AI Lab, 3Research Institute of Tsinghua University in Shenzhen, 4Zhejiang Lab |
| Pseudocode | Yes | Algorithm 1 Dynamics Adapted Imitation Learning (DYNAIL) |
| Open Source Code | Yes | To further demonstrate the efficacy of our methods, we provide experiment videos in https://github.com/Panda-Shawn/DYNAIL |
| Open Datasets | Yes | Custom ant is basically the same as ant from Open AI Gym (Brockman et al., 2016) except for joint gear ratios. With lower joint gear ratios, the robot flips less often and the agent learns fast. We refer this environment as Custom Ant-v0. Low Friction Quadruped. This environment is based on the source domain quadruped" with realwalk" task from realworldrl-suite (Dulac-Arnold et al., 2020). |
| Dataset Splits | No | We use 40 trajectories collected by expert as demonstrations. For all the experiments, we use the same pre-collected 40 expert trajectories on source domain (Custom Ant-v0) as expert demonstrations. |
| Hardware Specification | No | The paper does not provide specific hardware details for running the experiments. |
| Software Dependencies | No | We use PPO (Schulman et al., 2017) for the generator in AIL framework except for humanoid task where we use SAC (Haarnoja et al., 2018) for the generator, to optimize the policy and use 10 parallel environments to collect transitions on target domains. For all the experiments, the expert demonstrations are collected by using RL algorithms in Stable Baselines3 (Raffin et al., 2019). |
| Experiment Setup | Yes | The discriminator Dθ, classifiers qsa and qsas have the same structure of hidden layers, 2 layers of 256 units each, and a normalized input layer. We use Re LU as activation after each hidden layer. In all experiments, discounting factor is considered as 0.99. A key hyperparameter for our method is η, which serves as a tuning regularization. and we defer the full ablation study on η to the Appendix B.1. The hyperparameters are shown in Table 1. |