Illustrated Landmark Graphs for Long-horizon Policy Learning

Authors: Christopher Watson, Arjun Krishna, Rajeev Alur, Dinesh Jayaraman

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on long-horizon block stacking and point maze navigation tasks, and find that our approach achieves considerably higher success rates (50% improvement) compared to hierarchical reinforcement learning and imitation learning baselines.
Researcher Affiliation Academia Christopher Watson EMAIL Department of Computer and Information Science University of Pennsylvania Arjun Krishna EMAIL Department of Computer and Information Science University of Pennsylvania Rajeev Alur EMAIL Department of Computer and Information Science University of Pennsylvania Dinesh Jayaraman EMAIL Department of Computer and Information Science University of Pennsylvania
Pseudocode Yes Algorithm 1: ILG-Learn Input: ILG structure (U, E, u0), access to MDP M via reset and step, access to human teacher via request Illustration and query Success. Output: Path ρ and associated path policy π
Open Source Code Yes Our implementation is available at https://github.com/cwatson1998/ilg-learn. We implement ILG-Learn in Python.
Open Datasets Yes Stack. Our Stack family of environments is a customized robosuite (Zhu et al., 2020) environment that simulates a 7-Do F Franka Panda arm... Point Maze. We use custom layouts of the Point Maze environment from Gymnasium Robotics (de Lazcano et al., 2023; Fu et al., 2020).
Dataset Splits Yes For each task, we had a heldout set of 10 validation demonstrations. We stopped training when the validation loss stopped decreasing. (D.2 BC (MLP) baseline) For each environment, we include 10 demonstrations and an additional 5 held-out demonstrations for evaluating loss during training. (E Environment Details, Point Maze)
Hardware Specification No The paper mentions simulation environments like 'robosuite' and 'Gymnasium Robotics' and physics engines like 'Mu Jo Co', but does not specify any particular hardware (e.g., GPU/CPU models) used to run the experiments. Phrases like 'simulates a 7-Do F Franka Panda arm' describe the simulated environment, not the physical hardware.
Software Dependencies No The paper mentions several software components like Python, Jax RL, Jax, Flax, PyTorch, robosuite, Gymnasium Robotics, and MuJoCo. However, it does not provide specific version numbers for any of these components.
Experiment Setup Yes Table 3: ILG-Learn-specific parameter selection for the experiments shown in Section 5. [Table 3 contains specific values for: illustration Count, episode Length, interval Length, intervals Limit, success Threshold, estimation Queries, exploitation Bonus, edge Extension Penalty for various tasks.] We increased the width of each hidden layer in the actor and critic MLPs from 128 to 256. We use 10-step returns and max as the Q combinator for all tasks. For BC (MLP): Our MLP had 2 hidden layers (each of width 512), we used the Huber loss and a learning rate of 0.0001. Table 5: Diffusion Policy common training hyperparameters [contains prediction horizon, learning rate, weight decay, input embed dim, step embed dim, U-Net downsample dims, kernel size, num diffusion steps, EMA power, batch size].