OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Authors: zehan wang, Ziang Zhang, Minjie Hong, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Shengpeng Ji, Tao Jin, Hengshuang Zhao, Zhou Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the versatility of Omni Bind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding. We conduct quantitative experiments across 14 benchmarks covering 7 downstream tasks, as summarized in Tab. 1. Table 2: Cross-modal retrieval results. Table 3: Zero-shot classification results. Table 4: Ablation study of manual weights and the two weight routing objectives: Lalign and Ldec.
Researcher Affiliation Collaboration 1Zhejiang University; 3Shanghai AI Lab; 3The University of Hong Kong
Pseudocode No The paper describes the methodology using prose, mathematical equations (e.g., Eq. 1-7), and a high-level pipeline diagram (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No All checkpoints for the different versions of Omni Bind will be open-sourced.
Open Datasets Yes To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. For 3D data, we use the 800k 3D point clouds from Objaverse (Deitke et al., 2023). The audio and image data come from Audio Set (Gemmeke et al., 2017) and Image Net (Deng et al., 2009) respectively. The text data sources from three kinds of datasets: 3D-text (Liu et al., 2024b), visual-text (Lin et al., 2014; Sharma et al., 2018) and audio-text (Kim et al., 2019; Drossos et al., 2020) datasets.
Dataset Splits Yes To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. Based on these unpaired unimodal data, we employ state-of-the-art audio-text (Wav Caps (Mei et al., 2023)), image-text (EVACLIP-18B (Sun et al., 2024)), audio-image (Image Bind (Girdhar et al., 2023)) and 3D-image-text (Uni3D (Zhou et al., 2023)) models to retrieve the pseudo item pairs, as discussed in Sec. 3.1. The paper also evaluates on 14 benchmarks listed in Table 1, which are standard for these datasets, implying the use of predefined test/evaluation splits.
Hardware Specification Yes The entire training process can be completed with only several million unpaired data points using 4090 GPUs.
Software Dependencies No The paper does not explicitly list software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The temperature factors in contrastive losses are 0.03, and the λ in Eq. 7 is 3.