Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Authors: Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments across multiple vision-language benchmarks to demonstrate the effectiveness of our method. In this section, we present the main results on general video understanding (Sec. 4.1), long-form video benchmarks (Sec. 4.2), 2D & 3D spatial understanding (Sec. 4.3). Finally, we provide analysis experiments and critical ablation studies on design elements.
Researcher Affiliation Collaboration Zuyan Liu1,2 , Yuhao Dong2,3 , Ziwei Liu3, Winston Hu2, Jiwen Lu1 , Yongming Rao2,1 1 Tsinghua University 2 Tencent 3 S-Lab, NTU
Pseudocode No The paper describes its methods in section 3 "Oryx Architecture: MLLM with Native and Flexible Visual Inputs" and 3.1.2 "On-demand Dynamic Compression Supporting Long Visual Context" using descriptive text and mathematical formulation (Equation 1), but no structured pseudocode or algorithm blocks are present.
Open Source Code No The paper discusses the use of open-source datasets and refers to comparisons with other open-source models, but it does not provide any statement or link indicating that the source code for the Oryx model itself is publicly available or released.
Open Datasets Yes We utilize Movie Net (Huang et al., 2020) dataset... Specifically, we utilize Track-Anything (Yang et al., 2023b) as our tracking model to generate coarse correspondences for the Scan QA training set... This data is sourced from various open-source academic datasets, including LLa VA-Ne Xt (Liu et al., 2024c), Cauldron (Laurenc on et al., 2024), and Cambrian-1 (Tong et al., 2024)... Video Chat GPT-Plus (Maaz et al., 2024), Share GPT4Video (Chen et al., 2024a) and LLa VA-Hound (Zhang et al., 2024b)... Cinepile (Rawal et al., 2024), Next QA (Xiao et al., 2021) and Perception Test (Patraucean et al., 2024)... Scan QA (Azuma et al., 2022) training dataset... For image captioning, we used the Caps Fusion (Yu et al., 2024) datasets, and for OCR tasks, we employed synthesized OCR data pairs with OCR models. We start from a well-trained vision tower Oryx Vi T and a Large Language Model. The first stage involves only image data following common practice (Liu et al., 2024d;b).
Dataset Splits Yes We evaluate the Oryx model on a wide range of multi-modal benchmarks, demonstrating remarkable performance in both spatial and temporal understanding across image, video, and multi-view 3D data. Notably, the Oryx model excels in general and long-form video comprehension... on several benchmarks, including Next QA (Xiao et al., 2021), Perception Test (Patraucean et al., 2024), MMBench-Video (Fang et al., 2024), and MVBench (Li et al., 2024c) for general video understanding and MLVU (Zhou et al., 2024), Long Video Bench (Wu et al., 2024) for long-form video benchmark.
Hardware Specification Yes We tested the inference speed with an input image size of 1280 × 1280 on one NVIDIA A100 GPU. We observed that Oryx Vi T is only 7% slower than Sig LIP with the dynamic partition approach... Our experiment was conducted on an NVIDIA A100 GPU, using square-shaped input images... We adopt the total batch size at 128 and conduct our experiments on 64 NVIDIA A100-40G GPUs for Oryx-7B and 64 NVIDIA A800-80G GPUs for Oryx-34B, as larger models need more GPU memories.
Software Dependencies No We implement variable-length self-attention using the highly optimized Flash Attention (Dao et al., 2022) library. This allows the inference throughput of our arbitrary-resolution visual encoder to remain comparable to the dynamic partition approach used in previous methods. Although Flash Attention is mentioned, no specific version number for this or any other software library is provided.
Experiment Setup Yes We set the batch size to 2048 and used a similar cross-entropy loss as in the main stages... We adopt the total training batch size at 256 and the overall learning rate at 1e-3... We set the learning rate at 2e-5 for Oryx-7B and the learning rate at 1e-5 for Oryx-34B. We adopt the total batch size at 128 and conduct our experiments on 64 NVIDIA A100-40G GPUs for Oryx-7B and 64 NVIDIA A800-80G GPUs for Oryx-34B... The total model maximum length is set as 8192. For stage 2... a total batch size of 128, a learning rate of 2e-5 for Oryx-7B, and a learning rate of 1e-5 for Oryx-34B and Oryx-1.5-32B... The maximum sequence length is set to 16384.