PolaFormer: Polarity-aware Linear Attention for Vision Transformers
Authors: Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, Zheng Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the proposed Pola Former improves performance on various vision tasks, enhancing both expressiveness and efficiency by up to 4.6%. In this section, we evaluate our Pola Former model on three tasks: image classification on Image Net1K (Deng et al., 2009), object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et al., 2019), comparing its performance with previous efficient vision models. Additionally, we assess Pola Former on the Long Range Arena (LRA) task (Tay et al., 2021) to compare against other linear attention models. |
| Researcher Affiliation | Academia | 1 Harbin Institute of Technology, Shenzhen, China 2 Pengcheng Laboratory, China 3 UQMM Lab, University of Queensland, Australia |
| Pseudocode | No | The paper describes the methodology using mathematical formulations and descriptive text, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like procedures. |
| Open Source Code | Yes | Code is available at https://github.com/Zachary Meng/Pola Former. |
| Open Datasets | Yes | We evaluate our Pola Former model on three tasks: image classification on Image Net1K (Deng et al., 2009), object detection and instance segmentation on COCO (Lin et al., 2014), and semantic segmentation on ADE20K (Zhou et al., 2019), comparing its performance with previous efficient vision models. Additionally, we assess Pola Former on the Long Range Arena (LRA) task (Tay et al., 2021) to compare against other linear attention models. |
| Dataset Splits | Yes | The Image Net-1K (Deng et al., 2009) dataset is the widely used dataset for image classification tasks, containing 1,000 categories and over 1.2 million training images. We further validate the effectiveness of the proposed approach across various vision tasks, including object detection task on the COCO dataset (Lin et al., 2014), which contains over 118K training images and 5K validation images. |
| Hardware Specification | Yes | The models were pretrained on 8 NVIDIA A800 GPUs and fine-tuned on 8 NVIDIA RTX A6000 and 8 NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions several software components and projects like 'Adam W optimizer', 'Swin Transformer implementation made by Microsoft', 'mmcv-detection (Contributors, 2018) project', 'mmcv-segmentation (Contributors, 2018) project', and 'Skyformer (Chen et al., 2021)'. However, it does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | In this task, we use the Adam W optimizer (Loshchilov & Hutter, 2019) to train all of our models for 400 epochs, including 20 epochs for linear warm-up. The basic learning rate is set to 1e 3 for 1024 batch size. Additionally, we use a weight decay of 5e 2. For the PVT model, we select from Retina Net and Mask R-CNN as detectors, with the schedule set to 1 . For the Swin model, we choose the detector from Mask R-CNN and Cascade Mask R-CNN as detectors, where models using Mask R-CNN are experimented with under both 1 and 3 schedule settings, while models using Cascade Mask R-CNN case are trained under the 3 schedule. The training epoch is set to 12 per schedule and we use the Adam W optimizer with a learning rate of 1e 4 and a weight decay of 1e 4. For Listops and Text Classification, we set batch size to 32 with 1e 4 learning rate. For Pathfinder, we set batch size to 128 with 5e 4 learning rate. For Image Classification, we set batch size to 256 with 1e 4 learning rate. For Retrieval sub-task, we set batch size to 16 with 2e 4 learning rate. All models are trained from scratch using the Adam W optimizer. |