reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Illustrated Landmark Graphs for Long-horizon Policy Learning

Authors: Christopher Watson, Arjun Krishna, Rajeev Alur, Dinesh Jayaraman

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on long-horizon block stacking and point maze navigation tasks, and find that our approach achieves considerably higher success rates (50% improvement) compared to hierarchical reinforcement learning and imitation learning baselines.
Researcher Affiliation	Academia	Christopher Watson EMAIL Department of Computer and Information Science University of Pennsylvania Arjun Krishna EMAIL Department of Computer and Information Science University of Pennsylvania Rajeev Alur EMAIL Department of Computer and Information Science University of Pennsylvania Dinesh Jayaraman EMAIL Department of Computer and Information Science University of Pennsylvania
Pseudocode	Yes	Algorithm 1: ILG-Learn Input: ILG structure (U, E, u0), access to MDP M via reset and step, access to human teacher via request Illustration and query Success. Output: Path ρ and associated path policy π
Open Source Code	Yes	Our implementation is available at https://github.com/cwatson1998/ilg-learn. We implement ILG-Learn in Python.
Open Datasets	Yes	Stack. Our Stack family of environments is a customized robosuite (Zhu et al., 2020) environment that simulates a 7-Do F Franka Panda arm... Point Maze. We use custom layouts of the Point Maze environment from Gymnasium Robotics (de Lazcano et al., 2023; Fu et al., 2020).
Dataset Splits	Yes	For each task, we had a heldout set of 10 validation demonstrations. We stopped training when the validation loss stopped decreasing. (D.2 BC (MLP) baseline) For each environment, we include 10 demonstrations and an additional 5 held-out demonstrations for evaluating loss during training. (E Environment Details, Point Maze)
Hardware Specification	No	The paper mentions simulation environments like 'robosuite' and 'Gymnasium Robotics' and physics engines like 'Mu Jo Co', but does not specify any particular hardware (e.g., GPU/CPU models) used to run the experiments. Phrases like 'simulates a 7-Do F Franka Panda arm' describe the simulated environment, not the physical hardware.
Software Dependencies	No	The paper mentions several software components like Python, Jax RL, Jax, Flax, PyTorch, robosuite, Gymnasium Robotics, and MuJoCo. However, it does not provide specific version numbers for any of these components.
Experiment Setup	Yes	Table 3: ILG-Learn-specific parameter selection for the experiments shown in Section 5. [Table 3 contains specific values for: illustration Count, episode Length, interval Length, intervals Limit, success Threshold, estimation Queries, exploitation Bonus, edge Extension Penalty for various tasks.] We increased the width of each hidden layer in the actor and critic MLPs from 128 to 256. We use 10-step returns and max as the Q combinator for all tasks. For BC (MLP): Our MLP had 2 hidden layers (each of width 512), we used the Huber loss and a learning rate of 0.0001. Table 5: Diffusion Policy common training hyperparameters [contains prediction horizon, learning rate, weight decay, input embed dim, step embed dim, U-Net downsample dims, kernel size, num diffusion steps, EMA power, batch size].