Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning

Authors: Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, Shu-Tao Xia

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in downstream tasks demonstrate the superiority of our model. 4 Experiments First, we pre-train the Point-Mo DE model using our block-to-scene pretraining strategy based on point cloud data from the scene domain. After pre-training, we directly transfer the pre-trained model to various downstream tasks in different point cloud domains for fine-tuning. Table 1: Classification accuracy on real-scanned (Scan Object NN) and synthetic (Model Net40) point clouds. Table 2: Part segmentation results on the Shape Net Part. Table 3: Object detection results on Scan Net V2. Table 4: Semantic segmentation results on S3DIS Area 5. 4.4 Ablation Study
Researcher Affiliation Academia Yaohua Zha1,2 , Tao Dai3 , Hang Guo1 , Yanzi Wang1 , Bin Chen4 , Ke Chen2 , Shu-Tao Xia2 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Institute of Perceptual Intelligence, Pengcheng Laboratory 3College of Computer Science and Software Engineering, Shenzhen University 4Harbin Institute of Technology, Shenzhen
Pseudocode No The paper describes the methodology in Section 3 and illustrates the architecture in Figure 3, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block, nor does it present any structured code-like procedures.
Open Source Code No The paper states, "Using these models pre-trained on the object domain allows us to incorporate object-level priors into our pipeline..." and "Since these models are open-source, we directly use the pre-trained weights provided by the official repository.". This refers to the use of *third-party* open-source models (like Point-MAE, Point-BERT, Point GPT) that the authors build upon, not the release of their own Point-Mo DE code for the methodology described in this paper.
Open Datasets Yes Shape Net [Chang et al., 2015]. SUNRGB-D [Song et al., 2015] and Scan Net V2 [Dai et al., 2017]. Scan Object NN [Uy et al., 2019] and Model Net40 [Wu et al., 2015]. Shape Net Part dataset [Chang et al., 2015]. S3DIS [Armeni et al., 2016] dataset for semantic segmentation tasks.
Dataset Splits Yes We validated our model using the indoor S3DIS [Armeni et al., 2016] dataset for semantic segmentation tasks. Specifically, we tested the model on Area 5 while training on other areas and report the mean Io U (m Io U) and mean Accuracy (m Acc). For Model Net40, we use 1K points as input, apply scale and translate data augmentation, and report classification accuracy with the standard voting mechanism. We utilize the same segmentation setting after the pre-trained encoder as in previous works [Pang et al., 2022; Zhang et al., 2022] for fair comparison.
Hardware Specification Yes We train the model for 200 epochs using 8 A100 GPUs.
Software Dependencies No The paper mentions using "Adam W optimizer" and describes architectural components like "Transformer" and "Point Net-based token embedding layer", but it does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x, Python 3.8).
Experiment Setup Yes We use the Adam W optimizer with a base learning rate of 5e-4 and a weight decay of 0.1. We train the model for 200 epochs using 8 A100 GPUs. For the object point cloud masks, we set the mask ratio to 60% following previous work [Pang et al., 2022; Dong et al., 2023]. Simultaneously, we randomly select 32 point blocks from the scene point cloud, with each block containing 2K local points. We extracted 50K points for each of the 6.2K samples. For Scan Object NN, we use 2K points as input. For Model Net40, we use 1K points as input.