Imitation Learning via Focused Satisficing
Authors: Rushit N. Shah, Nikolaos Agadakos, Synthia Sasulski, Ali Farajzadeh, Sanjiban Choudhury, Brian Ziebart
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments using a mix of simple, classic control environments (cartpole, lunarlander) and complex robotics environments (Mujoco hopper, halfcheetah, walker) from Open AI Gym [Brockman et al., 2016]. For each environment, we obtain 100 demonstrations from a suboptimal policy learned using PPO. This ensures that the majority of the resulting demonstrations are suboptimal and noisy. Human demonstrations for the lunarlander used in Section 3.7 are collected from nonexpert, human players using the joysticks on an XBox 360 video game controller. Demonstration return statistics for environment-specific demonstration sets of varying quality are provided in Table 1. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Illinois Chicago 2Department of Computer Science, Cornell University EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Online subdominance policy gradient... Algorithm 2 Snippet-based subdominance policy gradient... Algorithm 3 Offline, joint stochastic optimization |
| Open Source Code | No | The paper does not provide an explicit statement about the release of their own source code or a direct link to a code repository for the methodology described. It mentions using Stable Baselines3, which is a third-party tool. |
| Open Datasets | Yes | We conduct experiments using a mix of simple, classic control environments (cartpole, lunarlander) and complex robotics environments (Mujoco hopper, halfcheetah, walker) from Open AI Gym [Brockman et al., 2016]. |
| Dataset Splits | Yes | We sort all demonstrations by their total (true) return and then choose a subset by retaining the best or worst 90%, 80%, 70%, or 60% of the original set. We use this demonstration subset to train T-REX and Online Min Sub FI. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. |
| Software Dependencies | No | We implement the policy optimization of Min Sub FI using Stable Baselines3 [Raffin et al., 2021]. Across all experiments, all baseline methods use the same base policy model paired with Stable Baseline3 s implementation of the PPO algorithm [Schulman et al., 2017]. The paper mentions software packages like Stable Baselines3 and Open AI Gym, but does not provide specific version numbers for them within the text. |
| Experiment Setup | Yes | The experiments are not based on extensive hyperparameter tuning; rather, all policy networks use nearly the same hyperparameters (Table 2). Table 2: Values of PPO hyperparameters for each environment. cartpole learning rate 1e-4, entropy coeff 0, miniclip 0.2, total batch 512, horizon 2048, epochs 10, range steps 2e6. |