Imitation Learning via Focused Satisficing

Authors: Rushit N. Shah, Nikolaos Agadakos, Synthia Sasulski, Ali Farajzadeh, Sanjiban Choudhury, Brian Ziebart

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments using a mix of simple, classic control environments (cartpole, lunarlander) and complex robotics environments (Mujoco hopper, halfcheetah, walker) from Open AI Gym [Brockman et al., 2016]. For each environment, we obtain 100 demonstrations from a suboptimal policy learned using PPO. This ensures that the majority of the resulting demonstrations are suboptimal and noisy. Human demonstrations for the lunarlander used in Section 3.7 are collected from nonexpert, human players using the joysticks on an XBox 360 video game controller. Demonstration return statistics for environment-specific demonstration sets of varying quality are provided in Table 1.
Researcher Affiliation Academia 1Department of Computer Science, University of Illinois Chicago 2Department of Computer Science, Cornell University EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Online subdominance policy gradient... Algorithm 2 Snippet-based subdominance policy gradient... Algorithm 3 Offline, joint stochastic optimization
Open Source Code No The paper does not provide an explicit statement about the release of their own source code or a direct link to a code repository for the methodology described. It mentions using Stable Baselines3, which is a third-party tool.
Open Datasets Yes We conduct experiments using a mix of simple, classic control environments (cartpole, lunarlander) and complex robotics environments (Mujoco hopper, halfcheetah, walker) from Open AI Gym [Brockman et al., 2016].
Dataset Splits Yes We sort all demonstrations by their total (true) return and then choose a subset by retaining the best or worst 90%, 80%, 70%, or 60% of the original set. We use this demonstration subset to train T-REX and Online Min Sub FI.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments.
Software Dependencies No We implement the policy optimization of Min Sub FI using Stable Baselines3 [Raffin et al., 2021]. Across all experiments, all baseline methods use the same base policy model paired with Stable Baseline3 s implementation of the PPO algorithm [Schulman et al., 2017]. The paper mentions software packages like Stable Baselines3 and Open AI Gym, but does not provide specific version numbers for them within the text.
Experiment Setup Yes The experiments are not based on extensive hyperparameter tuning; rather, all policy networks use nearly the same hyperparameters (Table 2). Table 2: Values of PPO hyperparameters for each environment. cartpole learning rate 1e-4, entropy coeff 0, miniclip 0.2, total batch 512, horizon 2048, epochs 10, range steps 2e6.