OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Authors: Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Section 4: Experiments, further details the simulation and real-world experimental setup, baselines, and evaluation of OTTER's performance.
Researcher Affiliation Collaboration 1University of California, Berkeley 2Meta AI. Correspondence to: Huang Huang <EMAIL>, Fangchen Liu <fangchen EMAIL>, Letian Fu <EMAIL>. The authors are affiliated with both the University of California, Berkeley (an academic institution) and Meta AI (an industry research lab), indicating a collaborative effort.
Pseudocode No The paper describes the methods and model architecture in prose and mathematical equations (e.g., Section 3.1, Equation 1-4) and diagrams (Figure 2, 3), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Video, code, checkpoints, and dataset: https://ottervla.github.io/.
Open Datasets Yes Video, code, checkpoints, and dataset: https://ottervla.github.io/. We use the LIBERO benchmark (Liu et al., 2024) for simulation evaluation. trained from scratch on 800K trajectories from the Open X-Embodiment dataset (Collaboration et al., 2024).
Dataset Splits Yes We use the LIBERO benchmark (Liu et al., 2024) for simulation evaluation. Specifically, we use the tasks and datasets in LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-90... Each simulation task has 50 demonstrations. We evaluate OTTER s capabilities on both in-distribution tasks and unseen tasks... We consider 19 in-distribution training tasks and 15 out-of-distribution unseen tasks across the 4 primitives (Table 6).
Hardware Specification Yes All the models are trained on 4 NVIDIA A100 80GB GPUs. This enables the Vi T-L/14 OTTER model to perform inference at 50Hz on a single NVIDIA 3090Ti, allowing real-time control.
Software Dependencies No The paper mentions using "CLIP" and refers to a "Vi T Encoder based on the implementation of https://github.com/google-research/vision_transformer" but does not provide specific version numbers for these or other key software libraries or frameworks (e.g., PyTorch, Python versions) used in the experiments.
Experiment Setup Yes Table 7: Hyperparameters for OTTER model architecture. Table 8: Hyperparameters used for training (pre-training on OXE). These tables provide specific values for numerous hyperparameters such as learning rate, batch size, context length, network dimensions, and image processing details.