reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Authors: zehan wang, Ziang Zhang, Minjie Hong, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Shengpeng Ji, Tao Jin, Hengshuang Zhao, Zhou Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the versatility of Omni Bind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding. We conduct quantitative experiments across 14 benchmarks covering 7 downstream tasks, as summarized in Tab. 1. Table 2: Cross-modal retrieval results. Table 3: Zero-shot classification results. Table 4: Ablation study of manual weights and the two weight routing objectives: Lalign and Ldec.
Researcher Affiliation	Collaboration	1Zhejiang University; 3Shanghai AI Lab; 3The University of Hong Kong
Pseudocode	No	The paper describes the methodology using prose, mathematical equations (e.g., Eq. 1-7), and a high-level pipeline diagram (Figure 2), but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	All checkpoints for the different versions of Omni Bind will be open-sourced.
Open Datasets	Yes	To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. For 3D data, we use the 800k 3D point clouds from Objaverse (Deitke et al., 2023). The audio and image data come from Audio Set (Gemmeke et al., 2017) and Image Net (Deng et al., 2009) respectively. The text data sources from three kinds of datasets: 3D-text (Liu et al., 2024b), visual-text (Lin et al., 2014; Sharma et al., 2018) and audio-text (Kim et al., 2019; Drossos et al., 2020) datasets.
Dataset Splits	Yes	To construct the pseudo-paired data, we collect unpaired 3D point, audio, vision, and text data from the training set of existing datasets. Based on these unpaired unimodal data, we employ state-of-the-art audio-text (Wav Caps (Mei et al., 2023)), image-text (EVACLIP-18B (Sun et al., 2024)), audio-image (Image Bind (Girdhar et al., 2023)) and 3D-image-text (Uni3D (Zhou et al., 2023)) models to retrieve the pseudo item pairs, as discussed in Sec. 3.1. The paper also evaluates on 14 benchmarks listed in Table 1, which are standard for these datasets, implying the use of predefined test/evaluation splits.
Hardware Specification	Yes	The entire training process can be completed with only several million unpaired data points using 4090 GPUs.
Software Dependencies	No	The paper does not explicitly list software dependencies with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The temperature factors in contrastive losses are 0.03, and the λ in Eq. 7 is 3.