LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments across multiple simulated and real-world tasks, we demonstrate that LLa RA achieves stateof-the-art performance while preserving the generalization capabilities of large language models.
Researcher Affiliation Academia 1Stony Brook University 2University of Wisconsin-Madison
Pseudocode No The paper describes methods and data generation in text and uses diagrams (e.g., Fig. 1, Fig. 2, Fig. 4) to illustrate concepts and data formats. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures that detail a structured algorithm.
Open Source Code Yes The code, datasets, and pretrained models are available at https://github.com/Lost Xine/LLa RA.
Open Datasets Yes We employ VIMA-Bench (Jiang et al., 2023), a simulated table-top robot manipulation environment to evaluate VLMs trained by our instruction tuning dataset. The code, datasets, and pretrained models are available at https://github.com/Lost Xine/LLa RA.
Dataset Splits Yes We uniformly subsample the VIMA dataset (Jiang et al., 2023) to form three subsets with different sizes: VIMA-0.8k, VIMA-8k, and VIMA-80k where the number indicates the number of expert trajectories in the dataset. We train all methods on these three datasets and evaluate them with 3 levels of difficulties following the test protocol (L1 to L3).
Hardware Specification No In Appendix C.1 Environment Setting, the paper states: "We utilize an x Arm7 robot arm equipped with a gripper and a Logitech C140 RGB webcam positioned above the arm to gather observations." This describes the robot hardware for real-world experiments but does not provide specific details about the computational hardware (e.g., GPU models, CPU types, memory) used for training or running the models.
Software Dependencies No The paper mentions using a 'pretrained LLa VA-1.5-7B (Liu et al., 2024b) model' and frameworks/models like 'GPT-4 (Open AI, 2023)' and 'OWLv2 (Minderer et al., 2024)'. However, it does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, CUDA, or other libraries that would be necessary to replicate the experiment environment.
Experiment Setup Yes The training settings closely align with those of the original LLa VA stage 2. Specifically, we utilize a single-cycle cosine annealing scheduling with 0.03 warm-up ratio and a maximum learning rate of 2 10 5. However, for VIMA-0.8k and VIMA-8k, we employ a batch size of 32, whereas for VIMA-80k, we restore the batch size to 128.