SGFormer: Semantic-Geometry Fusion Transformer for Multi-modal 3D Panoptic Segmentation
Authors: Hongqi Yu, Sixian Chan, Xiaolong Zhou, Xiaoqin Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Notably, SGFormer achieves the state-of-the-art (SOTA) results on the nu Scenes and Semantic POSS, as well as yielding competitive performance on the Semantic KITTI. Moreover, SGFormer exhibits superior robustness compared to leading methods, marking an improvement of 2% to 10%. Table 1: Comparison of 3D panoptic segmentation on nu Scenes validation set, in which PQ% is the primary metric for comparison. The firstand second-best results are highlighted in bold and underline, respectively. Table 2: Comparison on nu Scenes test set. Table 3: Comparison on Semantic KITTI validation set. Table 4: Comparison on Semantic POSS validation set. Table 5: Competitive results on different robustness setting. Table 6: Ablation study of network architecture. Table 7: Detailed ablation study for the ASCA. Table 8: Ablation study for the SGTransformer. |
| Researcher Affiliation | Academia | Hongqi Yu1, Sixian Chan2*, Xiaolong Zhou3, Xiaoqin Zhang1* 1Key Laboratory of Intelligent Informatics for Safety and Emergency of Zhejiang Province, Wenzhou University, China 2College of Computer Science and Technology, Zhejiang University of Technology, China 3College of Electrical and Information Engineering, Quzhou University, China |
| Pseudocode | No | The paper describes the methodology using descriptive text and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Datasets. nu Scenes (Fong et al. 2022) is a large-scale benchmark, containing 1000 scenes. Semantic KITTI (Behley et al. 2019) is an outdoor dataset, consisting of 22 sequences. Semantic POSS (Pan et al. 2020) is a challenging benchmark, including 2988 scenes of 6 sequences. ... Table 9: Comparison (m Io U) on different baselines with using ASCA (*) on Cityscapes(val) PASCAL VOC(val) Cam Vid(test). ...Table 10: Comparison on out-of-distribution generalization. ... from the Robo3D (Kong et al. 2023) benchmark. |
| Dataset Splits | Yes | Results on nu Scenes. As shown in Table 1, SGFormer outperforms state-of-the-arts with higher panoptic segmentation performance on the nu Scenes val set. Specifically, our method surpasses recent LCPS (Zhang et al. 2023) by 1.1% on PQ and 0.8% on m Io U. Moreover, in Table 2, our SGFormer achieves top-performing results than Panoptic PHNet (Li et al. 2022) and further surpasses LCPS on all metrics. These results demonstrate SGFormer can better distinguish objects through semantic-geometry fusion, significantly advancing 3D panotic segmentation. Results on Semantic KITTI and Semantic POSS. ...Additionally, on Semantic POSS, which features much smaller and sparser point clouds, SGFormer surpasses existing methods across almost all metrics in Table 4. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | Implementation Details. The specific details are provided in the supplementary material. In ASCA, the groups g = 8 for alignment. In SGTransformer, we set δ to 0.1 and use two fusion layers, each layer with four self-attention and one cross-attention equipped with 128 input channels. In terms of loss weights, we set λhm = 100, λo = 10 and λc = 1. |