Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Authors: Lucy Xiaoyang Shi, Brian Ichter, Michael Robert Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dualarm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. |
| Researcher Affiliation | Collaboration | 1Physical Intelligence 2Stanford University 3University of California, Berkeley. Correspondence to: Physical Intelligence <EMAIL>. |
| Pseudocode | No | The paper describes the hierarchical reasoning system and its components but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to the project website 'https://www.pi.website/research/hirobot' but does not explicitly state that the source code for the methodology described in this paper is openly available or provide a direct link to a code repository. |
| Open Datasets | No | We collect robot demonstration data Ddemo via teleoperation. This yields trajectories with coarse language annotations of the overall goal (e.g., make a sandwich). Next, we use a large vision-language model (VLM) pgen to produce synthetic user prompts and interjections ℓt, and corresponding robot utterance ut. The paper describes using its own collected and synthetically generated data but does not provide access information for a publicly available dataset used for the experiments. |
| Dataset Splits | No | The paper describes the creation of robot demonstration data Ddemo and synthetic data Dsyn for training, but it does not provide specific details regarding dataset splits (e.g., percentages, sample counts for train/validation/test sets). |
| Hardware Specification | Yes | To support real-time inference, we utilize one to two NVIDIA Ge Force RTX 4090 consumer-grade GPUs. High-Level Policy (Single Decoding Step) RTX 4090: 47 ms (prefill) + 13.2 ms (decode) H100: 17.3 ms (prefill) + 5.7 ms (decode) Training the high-level policy is highly efficient, requiring approximately 2 hours on 8 H100 GPUs. |
| Software Dependencies | Yes | Speech-to-text transcription is handled locally using Whisper large-v2 (Radford et al., 2023). |
| Experiment Setup | Yes | We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.95, and no weight decay. Gradient norms are clipped to a maximum magnitude of 1. We maintain an exponential moving average (EMA) of the network weights with a decay factor of 0.999. The learning rate is warmed up over the first 1,000 steps and then held constant at 1 10 5. We use a batch size of 512. |