UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation

Authors: Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that... Experimental results show that UNIP outperforms various pre-training methods by up to 13.5% in average m Io U on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics.
Researcher Affiliation Academia 1MAIS, Institute of Automation, Chinese Academy of Sciences, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, China 3School of Automation Science and Electrical Engineering, Beihang University, China EMAIL
Pseudocode No The paper describes methods and processes in text and with diagrams (e.g., Figure 1, Figure 8), but does not contain any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our code is available at https://github.com/casiatao/UNIP. ... Source Code. The source code of our work is available at this link. Researchers can access and utilize our code to reproduce the experimental results in this paper. The source code and pre-trained model weights will be made publicly available.
Open Datasets Yes The evaluation is conducted on three infrared semantic segmentation datasets: SODA (Li et al., 2021a), MFNet-T (Ha et al., 2017), and SCUT-Seg (Xiong et al., 2021). ... Additionally, RGB datasets like Image Net-1K (Deng et al., 2009) and ADE20K (Zhou et al., 2017) are also used for comparison. ... To alleviate the distribution shift and reduce texture bias when distilling RGB pre-trained models for infrared tasks, we develop Inf Mix, a mixed dataset for distillation. ... It comprises 859,375 images from both RGB and infrared modalities, constructed through four steps. (1) ... so we collect a large and unlabelled infrared pre-training dataset called Inf Pre. It consists of 541,088 images from 23 infrared-related datasets. ... (2) A subset of Image Net-1K (Deng et al., 2009) is used... (3) The training set of COCO (Lin et al., 2014)...
Dataset Splits Yes SODA (Li et al., 2021a). This dataset ... comprises 1,168 training images and 1,000 test images... MFNet (Ha et al., 2017). ... It is divided into 784 training images, 392 validation images, and 393 test images... SCUT-Seg (Xiong et al., 2021). This dataset includes 1345 training images and 665 test images... ADE20K (Zhou et al., 2017). It consists of 20,210 training images and 2,000 test images... Image Net-1K (Deng et al., 2009). ... consisting of ... roughly 1.2 million training images, 50,000 validation images, and 100,000 test images.
Hardware Specification Yes All experiments are conducted using the Py Torch toolkit (Paszke et al., 2019) on 8 NVIDIA RTX 3090 GPUs.
Software Dependencies No All experiments are conducted using the Py Torch toolkit (Paszke et al., 2019) on 8 NVIDIA RTX 3090 GPUs. The models are trained for 100 epochs using MMSegmentation (Contributors, 2020). While PyTorch and MMSegmentation are mentioned, specific version numbers for these software components (other than the year of their respective papers) are not provided.
Experiment Setup Yes Table 12: Settings of semantic segmentation (Hyperparameters: Input resolution, Training epochs, Training iterations, Peak learning rate, Batch size, Optimizer, Weight decay, Optimizer momentum, Learning rate schedule, Minimal learning rate, Warmup steps). Table 14: Settings of pre-training (Hyperparameters: Input resolution, Training epochs, Warmup epochs, Optimizer, Base learning rate, Weight decay, Optimizer momentum, Batch size, Learning rate schedule, Augmentation). In C.1: "models are trained for 100 epochs... keep the learning rate constant and sweep the layerwise decay rate across {0.5, 0.65, 0.75, 0.85, 1.0}". In C.4: "lr = base lr batchsize / 256".