MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Authors: Jiaze Wang, Yi Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on Scan Object NN from 51.3% to 61.9%, and on Objaverse LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks. The paper includes sections such as 'Experiments', 'Experimental Setup', 'Zero-shot 3D Classification', 'Linear Probing 3D Classification', 'Ablation Study', and presents detailed performance tables (Table 1, 2, 3) and figures (Figure 1, 3, 4) showing quantitative results.
Researcher Affiliation Collaboration Jiaze Wang*2, Yi Wang*1, Ziyu Guo2, Renrui Zhang2, Donghao Zhou2, Guangyong Chen3 , Anfeng Liu1 , Pheng-Ann Heng2 1 Central South Unversity 2 The Chinese University of Hong Kong 3 Zhejiang Lab. Corresponding author: EMAIL Corresponding author: EMAIL
Pseudocode No The paper describes the methodology using text and mathematical formulas (e.g., equations 1-4) within the 'Method' and 'MM-Mixing Framework' sections, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly provide a link to source code, nor does it state that the code will be made available in supplementary materials or upon request.
Open Datasets Yes Pre-training datasets. Our model is pre-trained using triplets generated from four key datasets: Shape Net Core (Chang et al. 2015), 3D-FUTURE (Fu et al. 2021), ABO (Collins et al. 2022), and Objaverse (Deitke et al. 2023). Evaluation datasets. The Objaverse-LVIS dataset (Deitke et al. 2023), which is part of our evaluation... Additionally, we include Model Net40 (Wu et al. 2015)... The Scan Object NN (Uy et al. 2019) dataset...
Dataset Splits Yes Specifically, the Shape Net training set is composed entirely of triplets from the Shape Net Core dataset, which includes 52,470 3D shapes along with their associated images and text descriptions. Model Net40 (Wu et al. 2015)... with a test split of 2,468 shapes. The Scan Object NN (Uy et al. 2019) dataset... provides multiple variants such as OBJ-BG, OBJ-ONLY, and PB-T50-RS.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with versions) needed to replicate the experiment.
Experiment Setup No Further details regarding the implementation specifics for pre-training and evaluation are provided in the Appendix. The main text does not contain specific hyperparameters or training configurations.