PN-GAIL: Leveraging Non-optimal Information from Imperfect Demonstrations
Authors: Qiang Liu, Huiqiao Fu, Kaiqiang Tang, Chunlin Chen, Daoyi Dong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that PN-GAIL surpasses conventional baseline methods in dealing with imperfect demonstrations, thereby significantly augmenting the practical utility of imitation learning in real-world contexts. Our codes are available at https://github.com/Qiang Liu T/PN-GAIL. Experiments on six control tasks are conducted to show the efficiency of our method in dealing with imperfect demonstrations compared to baseline methods. |
| Researcher Affiliation | Academia | Qiang Liu, Huiqiao Fu, Kaiqiang Tang & Chunlin Chen School of Management and Engineering Nanjing University Nanjing, China EMAIL, EMAIL Daoyi Dong The Australian Artificial Intelligence Institute University of Technology Sydney Sydney, Australia EMAIL |
| Pseudocode | Yes | The pseudocode for the overall algorithm can be found in Appendix A. Algorithm 1 PN-GAIL |
| Open Source Code | Yes | Our codes are available at https://github.com/Qiang Liu T/PN-GAIL. |
| Open Datasets | Yes | Task setup We conduct experiments across six environments (Pendulum-v1, Ant-v2, Walker2d-v2, Hopper-v2, Swimmer-v2, and Half Cheetah-v2). ... For the Ant-v2, Walker2d-v2, Hopper-v2, Swimmer-v2, and Half Cheetah-v2 environments, to maintain fairness, we directly utilize the demonstrations and confidence scores provided by the code of 2IWIL. |
| Dataset Splits | Yes | During the practical experiments across all six environments, 20% of the given demonstrations are randomly selected to be assigned confidence scores, which means that the label ratio is 0.2. ... In our experiments, we use different numbers of Dc + Du for different tasks, and the specific values are shown in Appendix C.1. Table 3 shows the number of confidence data and unlabeled data used for each task... |
| Hardware Specification | Yes | All of our experiments are run on a single machine with 4 NVIDIA Ge Force RTX 3080 GPUs. |
| Software Dependencies | No | The paper mentions TRPO, PPO, SAC as RL methods and Adam as an optimizer, but does not provide specific software library versions (e.g., Python, PyTorch versions) for reproducibility. |
| Experiment Setup | Yes | Table 2: Hyper-parameters settings. Hyper-parameters value. γ 0.995. τ (Generalized Advantage Estimation) 0.97. Batch size 5, 000. Learning rate (Value network) 3e-4. Learning rate (Discriminator) 1e-3. Learning rate (Classifier) 3e-4. Optimizer Adam. |