reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

Authors: Han-Hung Lee, Yiming Zhang, Angel Chang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To compare our Duoduo CLIP to prior work, we conduct experiments showing shape classification on a collection of 3D assets (on Objaverse-LVIS in Sec. 4.2.1), real-world multi-view images (on MVImg Net in Sec. 4.2.2), and more fine-grained text-to-shape retrieval (Sec. 4.3). We also show image-to-shape retrieval results for out-of-distribution images (App. A.4.1) and concept mixing (App. A.4.2, where we retrieve shapes that maximizes similarity to two input images. We first describe the experimental setup including implementation details and the datasets we use.
Researcher Affiliation	Academia	Han-Hung Lee1,* Yiming Zhang1,* Angel X. Chang1,2 1Simon Fraser University, 2Canada CIFAR AI Chair, Amii EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and text, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/3dlg-hcvc/Duoduo CLIP
Open Datasets	Yes	We follow Open Shape (Liu et al., 2023a) and train our model on a combination of four datasets consisting of Objaverse (Deitke et al., 2023), ABO (Collins et al., 2022), Shape Net (Chang et al., 2015) and 3D-FUTURE (Fu et al., 2021). ... We leverage MVImg Net (Yu et al., 2023), a dataset comprising multi-view images for 220k real-world objects across 238 categories. ... Scan Object NN (Uy et al., 2019) provides point clouds derived from objects in Scene NN (Hua et al., 2016) and Scan Net (Dai et al., 2017).
Dataset Splits	Yes	In total, the combined dataset contains 874k shapes, with 46k shapes within the LVIS subset of Objaverse which are used for evaluation. ... We filter those with at least 12 views, evenly sampled, resulting in 190k objects. ... we retain 66k objects for the training split and 16k for the validation set, spanning 180 categories. ... We evaluate on the test set, which contains 583 shapes across 15 categories... Additionally, we use the Scan Net validation set, consisting of 3825 objects across 17 classes...
Hardware Specification	Yes	Table 1: Comparison of training time and GPU used for our method and recent point cloud based methods. Method GPU Time Open Shape Liu et al. (2023a) 1 A100 (80GB) 300 hr Uni3D Zhou et al. (2024) 24 A100 (40GB) 20 hr RECON++ Qi et al. (2024) 8 A800 (80GB) 1 day Ours (Full) 4 A40 (48GB) 14.3 hr Ours (6 layers) 4 A5000 (24GB) 14.3 hr
Software Dependencies	No	The pretrained CLIP model used for initialization as well as the contrastive target is the Vi T-B/32 CLIP model with checkpoint laion2b_s34b_b79k from the open source implementation Open CLIP (Ilharco et al., 2021). Although it mentions Open CLIP, specific version numbers for software dependencies like Python, PyTorch, or CUDA are not provided.
Experiment Setup	Yes	All of our models are trained with 16-bit mixed precision and a batch size of 1600 for 80 epochs. We use a learning rate of 5e 5 with cosine annealing. At each training step we randomly sample 1 to 6 multi-views for a batch of objects. The pretrained CLIP model used for initialization as well as the contrastive target is the Vi T-B/32 CLIP model with checkpoint laion2b_s34b_b79k from the open source implementation Open CLIP (Ilharco et al., 2021).