HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over Open VLA, representing a 50% relative gain. Visual results are provided at: https://hamster-robot.github.io/. Section 5: EXPERIMENTAL EVALUATION |
| Researcher Affiliation | Collaboration | 1NVIDIA 2University of Washington 3University of Southern California |
| Pseudocode | No | The paper describes its methodology in text and figures (e.g., Figure 2 and Figure 9) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "Since HAMSTER is built on both open-source VLMs and low-level policies, it can serve as a fully open-sourced enabler for the community-building visionlanguage-action models." This indicates the use of open-source components, but not an explicit release of the authors' own implementation code for HAMSTER. The provided URL "https://hamster-robot.github.io/" is a project page, not a specific code repository. |
| Open Datasets | Yes | Pixel Point Prediction. For pixel point prediction, we use the Robo Point dataset (Yuan et al., 2024b) with 770k pixel point prediction tasks. Simulated Robot Data. We additionally generate a dataset of simulated robotics tasks from RLBench (James et al., 2020). Real Robot Data. We source 10k trajectories from the Bridge dataset (Walke et al., 2023; Collaboration et al., 2023) and around 45k trajectories from DROID (Khazatsky et al., 2024). We also include a 660k-sample VQA dataset (Liu et al., 2024c) for co-training to preserve world knowledge. |
| Dataset Splits | Yes | The low-level 3D policies are trained with 320 episodes collected via teleoperation. We evaluate our approach in both simulation and real-world experiments. We generate 1000 episodes for each of 81 robot manipulation tasks in RLBench. Colosseum contains 100 training episodes for each task, without any visual variations, and evaluates on 25 evaluation episodes for each variation. For our real-world experiments, we collected all data using a Franka Panda arm through human teleoperation... Pick and place. We collected 220 episodes using 10 toy objects. Knock down objects. We collected 50 episodes with various objects... Press button. We collected 50 episodes with 4 colored buttons. |
| Hardware Specification | Yes | We train our VLM, VILA1.5-13B Lin et al. (2024), on a node equipped with eight NVIDIA A100 GPUs, each utilizing approximately 65 GB of memory. Training was conducted with 1 or 2 A6000 gpus (which determined the global batch size of 16 or 32). |
| Software Dependencies | No | The paper mentions specific models like VILA-1.5-13B, RVT-2, and 3D-DA, but does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We use an effective batch size of 256 and a learning rate of 1 10 5. We keep overall architecture and training hyperparameters the same as paper settings. In real-world experiments, we simplify the language instruction in the same way as for RVT2 when conditioning on HAMSTER 2D paths... In addition, we reduced the embedding dimension of the transformer to 60 from 120, removed proprioception information from past timesteps, and reduced the number of transformer heads to 6 from 12 in order to prevent overfitting. |