reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

What's the Move? Hybrid Imitation Learning via Salient Points

Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7 speedup over the most competitive baseline.
Researcher Affiliation	Collaboration	Priya Sundaresan 1, Hengyuan Hu 1, Quan Vuong2, Jeannette Bohg1, Dorsa Sadigh1 Equal contribution. 1Stanford University, 2Physical Intelligence
Pseudocode	No	The paper describes the method and architecture using textual descriptions and diagrams (e.g., Figure 3 for SPHINX-Waypoint Architecture & Training Objectives), but does not contain a dedicated pseudocode or algorithm block with structured steps.
Open Source Code	Yes	Our website contains code for data collection and training code along with supplementary videos: http://sphinx-manip.github.io.
Open Datasets	Yes	We consider three tasks, one real task of opening an articulated drawer (Drawer, 20 demonstrations) and two simulated environments in Robomimic (Mandlekar et al., 2021) (Can, 20 demonstrations, and Square, 50 demonstrations).
Dataset Splits	No	The paper mentions the number of demonstrations for each task (e.g., "Drawer, 20 demonstrations", "Can, 20 demonstrations", "Square, 50 demonstrations") and details about data augmentation for training, but it does not specify explicit training, validation, and test splits for these demonstrations in a reproducible manner. For example, it does not state how many demonstrations are used for training versus evaluation.
Hardware Specification	No	In all experiments, we assume access to two external camera viewpoints and a wrist-mounted camera on a Franka Panda robot. The paper specifies the robot used (Franka Panda) and camera setup, but does not provide details on the computing hardware (e.g., specific CPU or GPU models) used for training or inference.
Software Dependencies	No	We optimize the waypoint policy with Adam (Kingma & Ba, 2015) optimizer with base learning rate 1e 4 and cosine learning decay over the entire training process... The dense policy in SPHINX is a diffusion policy. We closely follow the original implementation of Chi et al. (2023). Specifically, we use Res Net-18 (He et al., 2016) encoder to process the wrist image... The paper mentions specific optimizers (Adam) and model architectures (ResNet-18, UNet) but does not provide specific version numbers for software libraries or programming languages (e.g., PyTorch version, Python version).
Experiment Setup	Yes	We optimize the waypoint policy with Adam (Kingma & Ba, 2015) optimizer with base learning rate 1e 4 and cosine learning decay over the entire training process, i.e. decaying to 0 at the end of training. We clip the gradient with maximum norm 1. We set batch size to 64. We also maintain an exponential moving average (EMA) of the policy with the decay rate annealing from 0 to 0.9999. We use the final EMA policy in all evaluations without any further model selection. All waypoint policies are trained for 2000 epochs. The Transformer has 6 layers and each layer has 512 embedding dimensions over 8 attention heads. We remove positional embeddings from Transformer as the point cloud input has no ordering. We set dropout to 0.1 for all Transformer blocks to avoid overfitting.