Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Authors: Han-Hung Lee, Yiming Zhang, Angel Chang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To compare our Duoduo CLIP to prior work, we conduct experiments showing shape classification on a collection of 3D assets (on Objaverse-LVIS in Sec. 4.2.1), real-world multi-view images (on MVImg Net in Sec. 4.2.2), and more fine-grained text-to-shape retrieval (Sec. 4.3). We also show image-to-shape retrieval results for out-of-distribution images (App. A.4.1) and concept mixing (App. A.4.2, where we retrieve shapes that maximizes similarity to two input images. We first describe the experimental setup including implementation details and the datasets we use. |
| Researcher Affiliation | Academia | Han-Hung Lee1,* Yiming Zhang1,* Angel X. Chang1,2 1Simon Fraser University, 2Canada CIFAR AI Chair, Amii EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations and text, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/3dlg-hcvc/Duoduo CLIP |
| Open Datasets | Yes | We follow Open Shape (Liu et al., 2023a) and train our model on a combination of four datasets consisting of Objaverse (Deitke et al., 2023), ABO (Collins et al., 2022), Shape Net (Chang et al., 2015) and 3D-FUTURE (Fu et al., 2021). ... We leverage MVImg Net (Yu et al., 2023), a dataset comprising multi-view images for 220k real-world objects across 238 categories. ... Scan Object NN (Uy et al., 2019) provides point clouds derived from objects in Scene NN (Hua et al., 2016) and Scan Net (Dai et al., 2017). |
| Dataset Splits | Yes | In total, the combined dataset contains 874k shapes, with 46k shapes within the LVIS subset of Objaverse which are used for evaluation. ... We filter those with at least 12 views, evenly sampled, resulting in 190k objects. ... we retain 66k objects for the training split and 16k for the validation set, spanning 180 categories. ... We evaluate on the test set, which contains 583 shapes across 15 categories... Additionally, we use the Scan Net validation set, consisting of 3825 objects across 17 classes... |
| Hardware Specification | Yes | Table 1: Comparison of training time and GPU used for our method and recent point cloud based methods. Method GPU Time Open Shape Liu et al. (2023a) 1 A100 (80GB) 300 hr Uni3D Zhou et al. (2024) 24 A100 (40GB) 20 hr RECON++ Qi et al. (2024) 8 A800 (80GB) 1 day Ours (Full) 4 A40 (48GB) 14.3 hr Ours (6 layers) 4 A5000 (24GB) 14.3 hr |
| Software Dependencies | No | The pretrained CLIP model used for initialization as well as the contrastive target is the Vi T-B/32 CLIP model with checkpoint laion2b_s34b_b79k from the open source implementation Open CLIP (Ilharco et al., 2021). Although it mentions Open CLIP, specific version numbers for software dependencies like Python, PyTorch, or CUDA are not provided. |
| Experiment Setup | Yes | All of our models are trained with 16-bit mixed precision and a batch size of 1600 for 80 epochs. We use a learning rate of 5e 5 with cosine annealing. At each training step we randomly sample 1 to 6 multi-views for a batch of objects. The pretrained CLIP model used for initialization as well as the contrastive target is the Vi T-B/32 CLIP model with checkpoint laion2b_s34b_b79k from the open source implementation Open CLIP (Ilharco et al., 2021). |