LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments across multiple simulated and real-world tasks, we demonstrate that LLa RA achieves stateof-the-art performance while preserving the generalization capabilities of large language models. |
| Researcher Affiliation | Academia | 1Stony Brook University 2University of Wisconsin-Madison |
| Pseudocode | No | The paper describes methods and data generation in text and uses diagrams (e.g., Fig. 1, Fig. 2, Fig. 4) to illustrate concepts and data formats. However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures that detail a structured algorithm. |
| Open Source Code | Yes | The code, datasets, and pretrained models are available at https://github.com/Lost Xine/LLa RA. |
| Open Datasets | Yes | We employ VIMA-Bench (Jiang et al., 2023), a simulated table-top robot manipulation environment to evaluate VLMs trained by our instruction tuning dataset. The code, datasets, and pretrained models are available at https://github.com/Lost Xine/LLa RA. |
| Dataset Splits | Yes | We uniformly subsample the VIMA dataset (Jiang et al., 2023) to form three subsets with different sizes: VIMA-0.8k, VIMA-8k, and VIMA-80k where the number indicates the number of expert trajectories in the dataset. We train all methods on these three datasets and evaluate them with 3 levels of difficulties following the test protocol (L1 to L3). |
| Hardware Specification | No | In Appendix C.1 Environment Setting, the paper states: "We utilize an x Arm7 robot arm equipped with a gripper and a Logitech C140 RGB webcam positioned above the arm to gather observations." This describes the robot hardware for real-world experiments but does not provide specific details about the computational hardware (e.g., GPU models, CPU types, memory) used for training or running the models. |
| Software Dependencies | No | The paper mentions using a 'pretrained LLa VA-1.5-7B (Liu et al., 2024b) model' and frameworks/models like 'GPT-4 (Open AI, 2023)' and 'OWLv2 (Minderer et al., 2024)'. However, it does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, CUDA, or other libraries that would be necessary to replicate the experiment environment. |
| Experiment Setup | Yes | The training settings closely align with those of the original LLa VA stage 2. Specifically, we utilize a single-cycle cosine annealing scheduling with 0.03 warm-up ratio and a maximum learning rate of 2 10 5. However, for VIMA-0.8k and VIMA-8k, we employ a batch size of 32, whereas for VIMA-80k, we restore the batch size to 128. |