OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Authors: Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Section 4: Experiments, further details the simulation and real-world experimental setup, baselines, and evaluation of OTTER's performance. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Meta AI. Correspondence to: Huang Huang <EMAIL>, Fangchen Liu <fangchen EMAIL>, Letian Fu <EMAIL>. The authors are affiliated with both the University of California, Berkeley (an academic institution) and Meta AI (an industry research lab), indicating a collaborative effort. |
| Pseudocode | No | The paper describes the methods and model architecture in prose and mathematical equations (e.g., Section 3.1, Equation 1-4) and diagrams (Figure 2, 3), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Video, code, checkpoints, and dataset: https://ottervla.github.io/. |
| Open Datasets | Yes | Video, code, checkpoints, and dataset: https://ottervla.github.io/. We use the LIBERO benchmark (Liu et al., 2024) for simulation evaluation. trained from scratch on 800K trajectories from the Open X-Embodiment dataset (Collaboration et al., 2024). |
| Dataset Splits | Yes | We use the LIBERO benchmark (Liu et al., 2024) for simulation evaluation. Specifically, we use the tasks and datasets in LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-90... Each simulation task has 50 demonstrations. We evaluate OTTER s capabilities on both in-distribution tasks and unseen tasks... We consider 19 in-distribution training tasks and 15 out-of-distribution unseen tasks across the 4 primitives (Table 6). |
| Hardware Specification | Yes | All the models are trained on 4 NVIDIA A100 80GB GPUs. This enables the Vi T-L/14 OTTER model to perform inference at 50Hz on a single NVIDIA 3090Ti, allowing real-time control. |
| Software Dependencies | No | The paper mentions using "CLIP" and refers to a "Vi T Encoder based on the implementation of https://github.com/google-research/vision_transformer" but does not provide specific version numbers for these or other key software libraries or frameworks (e.g., PyTorch, Python versions) used in the experiments. |
| Experiment Setup | Yes | Table 7: Hyperparameters for OTTER model architecture. Table 8: Hyperparameters used for training (pre-training on OXE). These tables provide specific values for numerous hyperparameters such as learning rate, batch size, context length, network dimensions, and image processing details. |