OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Authors: zehan wang, Ziang Zhang, Minjie Hong, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Shengpeng Ji, Tao Jin, Hengshuang Zhao, Zhou Zhao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the versatility of Omni Bind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding. We conduct quantitative experiments across 14 benchmarks covering 7 downstream tasks, as summarized in Tab. 1. Table 2: Cross-modal retrieval results. Table 3: Zero-shot classification results. Table 4: Ablation study of manual weights and the two weight routing objectives: Lalign and Ldec. |
| Researcher Affiliation | Collaboration | 1Zhejiang University; 3Shanghai AI Lab; 3The University of Hong Kong |
| Pseudocode | No | The paper describes the methodology using prose, mathematical equations (e.g., Eq. 1-7), and a high-level pipeline diagram (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | All checkpoints for the different versions of Omni Bind will be open-sourced. |
| Open Datasets | Yes | To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. For 3D data, we use the 800k 3D point clouds from Objaverse (Deitke et al., 2023). The audio and image data come from Audio Set (Gemmeke et al., 2017) and Image Net (Deng et al., 2009) respectively. The text data sources from three kinds of datasets: 3D-text (Liu et al., 2024b), visual-text (Lin et al., 2014; Sharma et al., 2018) and audio-text (Kim et al., 2019; Drossos et al., 2020) datasets. |
| Dataset Splits | Yes | To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. Based on these unpaired unimodal data, we employ state-of-the-art audio-text (Wav Caps (Mei et al., 2023)), image-text (EVACLIP-18B (Sun et al., 2024)), audio-image (Image Bind (Girdhar et al., 2023)) and 3D-image-text (Uni3D (Zhou et al., 2023)) models to retrieve the pseudo item pairs, as discussed in Sec. 3.1. The paper also evaluates on 14 benchmarks listed in Table 1, which are standard for these datasets, implying the use of predefined test/evaluation splits. |
| Hardware Specification | Yes | The entire training process can be completed with only several million unpaired data points using 4090 GPUs. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The temperature factors in contrastive losses are 0.03, and the λ in Eq. 7 is 3. |