Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Authors: Yuchen Zhou, Jiayuan Gu, Tung Chiang, Fanbo Xiang, Hao Su
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal. 5 EXPERIMENTS We have conducted the experiments showing the strong zero-shot transferability and the superior efficiency of our method. |
| Researcher Affiliation | Collaboration | Yuchen Zhou1,3 Jiayuan Gu2 Tung Yen Chiang3 Fanbo Xiang1 Hao Su1,3 1Hillbot Inc. 2Shanghai Tech University 3UC San Diego |
| Pseudocode | No | The paper describes methods in narrative text and figures (e.g., Figure 2 for network architecture) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We utilize synthetic datasets including the training split of Part Net (Mo et al., 2019), Part Net-Mobility (Xiang et al., 2020), and Fusion360 (Lambourne et al., 2021)... For scene-level datasets, we use the training split of Scan Net200 (Dai et al., 2017)... We use our data engine to generate pseudo labels for 20000 shapes from Shape Net (Chang et al., 2015)... We evaluate on a heterogeneous collection of datasets... Part Net-Mobility (Xiang et al., 2020) and the real-world dataset Scan Object NN (Uy et al., 2019)... For scene-level evaluation, we use S3DIS (Armeni et al., 2016) and KITTI-360 (Liao et al., 2022)... We compare with Open Mask3D (Takmaz et al., 2024) on Replica (Straub et al., 2019)... We use Shape Net Part (Yi et al., 2016)... |
| Dataset Splits | Yes | We utilize synthetic datasets including the training split of Part Net (Mo et al., 2019)... For Part Net-Mobility, we hold out 3 categories (scissors, refrigerators, and doors) not included in Shape Net, which are used for evaluation on unseen categories... For scene-level datasets, we use the training split of Scan Net200 (Dai et al., 2017) and augment it by splitting each scene into blocks. The augmented version is denoted as Scan Net-Block. Concretely, we use a 3m 3m block with a stride of 1.5m. We use FPS to sample 32768 points per scene or block... For each object, we pre-sample 32,768 points before training and then perform online random sampling of 10,000 points from these 32,768 points for actual training. |
| Hardware Specification | Yes | The training batch size for Point-SAM, utilizing Vi T-g as the encoder, is set to 4 per GPU with a gradient accumulation of 4, and it is trained on 8 NVIDIA H100 GPUs with a total batch size of 128. The Vi T-l version can be trained across 2 NVIDIA A100 GPUs... We test the time and memory efficiency on a single Nvidia RTX-4090 GPU using point clouds from KITTI360. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer but does not specify software versions for libraries like PyTorch, TensorFlow, or Python, nor CUDA versions. |
| Experiment Setup | Yes | Point-SAM is trained with the Adam W optimizer. We train Point-SAM for 100k iterations. The learning rate (lr) is set to 5e-5 after learning rate warmup. Initially, the lr is warmed up for 3k iterations, starting at 5e-8. A step-wise lr scheduler with a decay factor of 0.1 is then used, with lr reductions at 60k and 90k iterations. The weight decay is set to 0.1. The training batch size for Point-SAM, utilizing Vi T-g as the encoder, is set to 4 per GPU with a gradient accumulation of 4, and it is trained on 8 NVIDIA H100 GPUs with a total batch size of 128... For training, we randomly sample 10,000 points as input. Besides, we normalize the input point to fit within a unit sphere centered at zero, to standardize the inputs. The number of patches L and the patch size K are set to 512 and 64 by default. We apply several data augmentation techniques during training. For each object, we pre-sample 32,768 points before training and then perform online random sampling of 10,000 points from these 32,768 points for actual training. We apply a random scale for the normalized points with a scale factor of [0.8, 1.0] and a random rotation along y-axis from 180 to 180 . For object point clouds we also apply a random rotation perturbation to xand z-axis. The perturbation angles are sampled from a normal distribution with a standard deviation (sigma) of 0.06, and then these angles are clipped to the range [0, 0.18]. |