SPFormer: Enhancing Vision Transformer with Superpixel Representation
Authors: Jieru Mei, Liang-Chieh Chen, Alan Yuille, Cihang Xie
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation of SPFormer on the Image Net dataset demonstrates its superior efficiency and performance over the Dei T baseline under varying configurations as shown in Tab. 1. Specifically, SPFormer-S, which employs the standard Vi T configuration with 196 tokens, exceeds the performance of Dei T-S by 1.1%, achieving a top-1 accuracy of 81.0% compared to 79.9% for Dei T-S. Furthermore, SPFormer-T outperforms Dei T-T by 1.4%, recording 73.6% versus 72.2%. Table 2: Ablation study on the design choices in SPFormer. Table 3: Semantic segmentation on ADE20K val split. Table 4: Semantic segmentation on Pascal Conext val split. Table 5: Evaluation of superpixel quality in a zero-shot setting on Pascal VOC 2012 and Pascal-Parts-58 datasets, using 196 patches/superpixels. Table 6: Quantitative evaluation of SPFormer s robustness to rotation, comparing performance at different angles. |
| Researcher Affiliation | Collaboration | Jieru Mei EMAIL Department of Computer Science Johns Hopkins University Liang-Chieh Chen EMAIL Bytedance Alan Yuille EMAIL Department of Computer Science Johns Hopkins University Cihang Xie EMAIL Department of Computer Science and Engineering University of California, Santa Cruz |
| Pseudocode | No | The paper describes the mechanisms of SCA and iterative feature refinement using equations (Eq. 1, 3, 4) and textual descriptions, but no clearly labeled "Pseudocode" or "Algorithm" block is present. |
| Open Source Code | No | The paper does not provide an explicit statement or a direct link to the source code for the methodology described in this paper. Footnote 1 refers to official code for a different paper (SViT), not SPFormer. |
| Open Datasets | Yes | Our evaluation of SPFormer on the Image Net dataset demonstrates its superior efficiency and performance over the Dei T baseline under varying configurations as shown in Tab. 1. All models train on the Image Net dataset (Russakovsky et al., 2015) for 300 epochs. Furthermore, we assess the generalizability of our superpixel representation using the COCO dataset (Lin et al., 2014). We evaluate SPFormer on the ADE20K (Zhou et al., 2017) and Pascal Context (Mottaghi et al., 2014) datasets. This test involved a quantitative analysis on both object and part levels using the Pascal VOC 2012 dataset (Everingham et al., 2015) and Pascal-Part-58 (Zhao et al., 2019). |
| Dataset Splits | Yes | All models train on the Image Net dataset (Russakovsky et al., 2015) for 300 epochs. We evaluate SPFormer on the ADE20K (Zhou et al., 2017) and Pascal Context (Mottaghi et al., 2014) datasets. As shown in Tab. 3 and Tab. 4, the performance gains in m Io U are noteworthy when using Image Net-pretrained models: 4.2% improvement on ADE20K and 2.8% on Pascal Context. Table 3: Semantic segmentation on ADE20K val split. Table 4: Semantic segmentation on Pascal Conext val split. |
| Hardware Specification | No | The paper mentions "We thank the Center for AI Safety for supporting our computing needs." but does not specify any details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions using the "AdamW optimizer" and "Layer Scale technique" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Adhering to the protocols established in Dei T (Touvron et al., 2021a), we implement robust data augmentations, use the Adam W optimizer, and follow a cosine decay learning rate schedule. All models train on the Image Net dataset (Russakovsky et al., 2015) for 300 epochs. During SPFormer-B/16 training, significant overfitting challenges arose. Increasing the Stochastic Depth (Huang et al., 2016) rate from 0.1 to 0.6 effectively addressed these issues. |