reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Authors: Lucy Xiaoyang Shi, Brian Ichter, Michael Robert Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dualarm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping.
Researcher Affiliation	Collaboration	1Physical Intelligence 2Stanford University 3University of California, Berkeley. Correspondence to: Physical Intelligence <EMAIL>.
Pseudocode	No	The paper describes the hierarchical reasoning system and its components but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper refers to the project website 'https://www.pi.website/research/hirobot' but does not explicitly state that the source code for the methodology described in this paper is openly available or provide a direct link to a code repository.
Open Datasets	No	We collect robot demonstration data Ddemo via teleoperation. This yields trajectories with coarse language annotations of the overall goal (e.g., make a sandwich). Next, we use a large vision-language model (VLM) pgen to produce synthetic user prompts and interjections ℓt, and corresponding robot utterance ut. The paper describes using its own collected and synthetically generated data but does not provide access information for a publicly available dataset used for the experiments.
Dataset Splits	No	The paper describes the creation of robot demonstration data Ddemo and synthetic data Dsyn for training, but it does not provide specific details regarding dataset splits (e.g., percentages, sample counts for train/validation/test sets).
Hardware Specification	Yes	To support real-time inference, we utilize one to two NVIDIA Ge Force RTX 4090 consumer-grade GPUs. High-Level Policy (Single Decoding Step) RTX 4090: 47 ms (prefill) + 13.2 ms (decode) H100: 17.3 ms (prefill) + 5.7 ms (decode) Training the high-level policy is highly efficient, requiring approximately 2 hours on 8 H100 GPUs.
Software Dependencies	Yes	Speech-to-text transcription is handled locally using Whisper large-v2 (Radford et al., 2023).
Experiment Setup	Yes	We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.95, and no weight decay. Gradient norms are clipped to a maximum magnitude of 1. We maintain an exponential moving average (EMA) of the network weights with a decay factor of 0.999. The learning rate is warmed up over the first 1,000 steps and then held constant at 1 10 5. We use a batch size of 512.