Wavelet-Driven Masked Image Modeling: A Path to Efficient Visual Representation
Authors: Wenzhao Xiang, Chang Liu, Hongyang Yu, Xilin Chen
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method achieves comparable or superior performance across various downstream tasks while exhibiting higher training efficiency. [...] Main Results Image Classification. We evaluate our Wa MIM against existing MIM models, analyzing both pre-training efficiency and top-1 fine-tuning accuracy. The results are illustrated in Figure 1c and summarized in Table 1. [...] Ablation studies In this part, we conduct ablation experiments to assess the impact of key components in our approach and validate the design choices. Table 4a 4f show the Wa MIM ablation experimental results with Vi T-B and Swin-B on Image Net1-K. |
| Researcher Affiliation | Academia | Wenzhao Xiang1,2,3, Chang Liu4, Hongyang Yu2*, Xilin Chen1,3 1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 2Peng Cheng Laboratory 3 University of Chinese Academy of Sciences 4 Department of Electronic Engineering, Shanghai Jiao Tong University EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology textually and with mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states "* indicates results reproduced using the official code." referring to other methods but does not provide an explicit statement or a link for the code implementation of their own proposed method, Wa MIM. |
| Open Datasets | Yes | We perform pre-training on the Image Net-1K dataset (Russakovsky et al. 2015) without any ground-truth labels. [...] We start by fine-tuning Mask R-CNN with the COCO (Lin et al. 2014) train2017 dataset split [...] We further evaluate our method on semantic segmentation tasks using the ADE20k dataset (Zhou et al. 2017) |
| Dataset Splits | Yes | The input images are segmented into patches of size p = 16 for Vi T and p = 4 for Swin, and are randomly masked with a default ratio of r = 0.75. [...] We start by fine-tuning Mask R-CNN with the COCO (Lin et al. 2014) train2017 dataset split, followed by an evaluation of its performance on the val2017 split [...] We adapt the pre-trained Vi T models for semantic segmentation using Uper Net (Xiao et al. 2018) as the segmentor. We perform end-to-end finetuning on the ADE20k dataset (Zhou et al. 2017) for 160k iterations with an input resolution of 512 512, and evaluate performance on the validation set using the m Io U metric. |
| Hardware Specification | Yes | To ensure a fair comparison, we compute the pre-training efficiency of each method on identical hardware, utilizing a single Tesla A100-40G GPU, CUDA 11.7, and Py Torch 1.13. |
| Software Dependencies | Yes | To ensure a fair comparison, we compute the pre-training efficiency of each method on identical hardware, utilizing a single Tesla A100-40G GPU, CUDA 11.7, and Py Torch 1.13. |
| Experiment Setup | Yes | We perform pre-training on the Image Net-1K dataset (Russakovsky et al. 2015) without any ground-truth labels. We use the columnar Vi T (Dosovitskiy et al. 2021) and pyramidal Swin (Liu et al. 2021) architectures for the encoder, with an input size of 224 224. The input images are segmented into patches of size p = 16 for Vi T and p = 4 for Swin, and are randomly masked with a default ratio of r = 0.75. Basic data augmentations, including random cropping and horizontal flipping, are applied. For each architecture, we construct four reconstruction targets that vary from high-frequency, low-level to low-frequency, high-level. We perform a 5-level wavelet decomposition on the input image and select wavelet coefficients from levels 2 to 5 as the reconstruction targets, resulting in scales of {562, 282, 142, 72}. In the Swin architecture, we use output features from stages {2, 4, 22, 24} for prediction. Each feature s decoder consists of a transformer block with an embedding dimension of 128 and 4 attention heads. For Vi T, the chosen layers are {3, 6, 9, 12}, with each decoder comprising a transformer block with an embedding dimension of 256 and 8 attention heads. Loss weights are set to {0.8, 0.9, 1.1, 1.2}. Haar wavelet basis is employed for wavelet transform, and the wavelet coefficients used for reconstruction are normalized. |