PN-GAIL: Leveraging Non-optimal Information from Imperfect Demonstrations

Authors: Qiang Liu, Huiqiao Fu, Kaiqiang Tang, Chunlin Chen, Daoyi Dong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that PN-GAIL surpasses conventional baseline methods in dealing with imperfect demonstrations, thereby significantly augmenting the practical utility of imitation learning in real-world contexts. Our codes are available at https://github.com/Qiang Liu T/PN-GAIL. Experiments on six control tasks are conducted to show the efficiency of our method in dealing with imperfect demonstrations compared to baseline methods.
Researcher Affiliation Academia Qiang Liu, Huiqiao Fu, Kaiqiang Tang & Chunlin Chen School of Management and Engineering Nanjing University Nanjing, China EMAIL, EMAIL Daoyi Dong The Australian Artificial Intelligence Institute University of Technology Sydney Sydney, Australia EMAIL
Pseudocode Yes The pseudocode for the overall algorithm can be found in Appendix A. Algorithm 1 PN-GAIL
Open Source Code Yes Our codes are available at https://github.com/Qiang Liu T/PN-GAIL.
Open Datasets Yes Task setup We conduct experiments across six environments (Pendulum-v1, Ant-v2, Walker2d-v2, Hopper-v2, Swimmer-v2, and Half Cheetah-v2). ... For the Ant-v2, Walker2d-v2, Hopper-v2, Swimmer-v2, and Half Cheetah-v2 environments, to maintain fairness, we directly utilize the demonstrations and confidence scores provided by the code of 2IWIL.
Dataset Splits Yes During the practical experiments across all six environments, 20% of the given demonstrations are randomly selected to be assigned confidence scores, which means that the label ratio is 0.2. ... In our experiments, we use different numbers of Dc + Du for different tasks, and the specific values are shown in Appendix C.1. Table 3 shows the number of confidence data and unlabeled data used for each task...
Hardware Specification Yes All of our experiments are run on a single machine with 4 NVIDIA Ge Force RTX 3080 GPUs.
Software Dependencies No The paper mentions TRPO, PPO, SAC as RL methods and Adam as an optimizer, but does not provide specific software library versions (e.g., Python, PyTorch versions) for reproducibility.
Experiment Setup Yes Table 2: Hyper-parameters settings. Hyper-parameters value. γ 0.995. τ (Generalized Advantage Estimation) 0.97. Batch size 5, 000. Learning rate (Value network) 3e-4. Learning rate (Discriminator) 1e-3. Learning rate (Classifier) 3e-4. Optimizer Adam.