What's the Move? Hybrid Imitation Learning via Salient Points
Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7 speedup over the most competitive baseline. |
| Researcher Affiliation | Collaboration | Priya Sundaresan 1, Hengyuan Hu 1, Quan Vuong2, Jeannette Bohg1, Dorsa Sadigh1 Equal contribution. 1Stanford University, 2Physical Intelligence |
| Pseudocode | No | The paper describes the method and architecture using textual descriptions and diagrams (e.g., Figure 3 for SPHINX-Waypoint Architecture & Training Objectives), but does not contain a dedicated pseudocode or algorithm block with structured steps. |
| Open Source Code | Yes | Our website contains code for data collection and training code along with supplementary videos: http://sphinx-manip.github.io. |
| Open Datasets | Yes | We consider three tasks, one real task of opening an articulated drawer (Drawer, 20 demonstrations) and two simulated environments in Robomimic (Mandlekar et al., 2021) (Can, 20 demonstrations, and Square, 50 demonstrations). |
| Dataset Splits | No | The paper mentions the number of demonstrations for each task (e.g., "Drawer, 20 demonstrations", "Can, 20 demonstrations", "Square, 50 demonstrations") and details about data augmentation for training, but it does not specify explicit training, validation, and test splits for these demonstrations in a reproducible manner. For example, it does not state how many demonstrations are used for training versus evaluation. |
| Hardware Specification | No | In all experiments, we assume access to two external camera viewpoints and a wrist-mounted camera on a Franka Panda robot. The paper specifies the robot used (Franka Panda) and camera setup, but does not provide details on the computing hardware (e.g., specific CPU or GPU models) used for training or inference. |
| Software Dependencies | No | We optimize the waypoint policy with Adam (Kingma & Ba, 2015) optimizer with base learning rate 1e 4 and cosine learning decay over the entire training process... The dense policy in SPHINX is a diffusion policy. We closely follow the original implementation of Chi et al. (2023). Specifically, we use Res Net-18 (He et al., 2016) encoder to process the wrist image... The paper mentions specific optimizers (Adam) and model architectures (ResNet-18, UNet) but does not provide specific version numbers for software libraries or programming languages (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We optimize the waypoint policy with Adam (Kingma & Ba, 2015) optimizer with base learning rate 1e 4 and cosine learning decay over the entire training process, i.e. decaying to 0 at the end of training. We clip the gradient with maximum norm 1. We set batch size to 64. We also maintain an exponential moving average (EMA) of the policy with the decay rate annealing from 0 to 0.9999. We use the final EMA policy in all evaluations without any further model selection. All waypoint policies are trained for 2000 epochs. The Transformer has 6 layers and each layer has 512 embedding dimensions over 8 attention heads. We remove positional embeddings from Transformer as the point cloud input has no ordering. We set dropout to 0.1 for all Transformer blocks to avoid overfitting. |