Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Authors: Han Guo, Ramtin Hosseini, Ruiyi Zhang, Sai Ashish Somayajula, Ranak Roy Chowdhury, Rajesh K. Gupta, Pengtao Xie

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental findings highlight MLO-MAE s significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at https://github.com/Alexiland/MLO-MAE
Researcher Affiliation Academia Han Guo EMAIL UC San Diego Ramtin Hosseini EMAIL UC San Diego Ruiyi Zhang EMAIL UC San Diego Sai Ashish Somayajula EMAIL UC San Diego Ranak Roy Chowdhury EMAIL UC San Diego Rajesh K. Gupta EMAIL UC San Diego Pengtao Xie EMAIL UC San Diego
Pseudocode Yes Algorithm 1 MLO-MAE Optimization Algorithm
Open Source Code Yes Our code is available at https://github.com/Alexiland/MLO-MAE
Open Datasets Yes Our approach outperforms a range of leading-edge methods in learning representations, as evidenced across various datasets such as CIFAR-10, CIFAR-100, and Image Net-1K. Our method showcases remarkable transfer learning abilities, in fine-grained classification, semantic segmentation, and object detection tasks, demonstrated on datasets including CUB-200-2011, Stanford Cars, i Naturalist 2019, ADE20K, and MS-COCO.
Dataset Splits Yes We randomly split the training set of Image Net by a ratio of 80/20 to be the new training set and the new validation set. We use the same new training set in Stage I and Stage II for training, while use the new validation set in Stage III for training the masking network.
Hardware Specification Yes All experiments were conducted on Nvidia A100 GPUs.
Software Dependencies No The paper mentions using PyTorch code and AdamW optimizer, and MMSegmentation, but does not provide specific version numbers for these software components.
Experiment Setup Yes In our method, the masking network is structured with multiple layers. Initially, there is a linear layer, where the input size is determined by the product of the number of patches (196 for Image Net and 256 for CIFAR) and the embedding dimension (we used the patch embedding method in Vi T, with a dimension of 768), and it has a hidden size of 512. This is followed by a Re LU layer. Next, there is another linear layer, which takes an input size of 512 and produces an output size equivalent to the number of patches. Finally, a sigmoid activation function is applied to the output to generate probabilities in the range of 0 to 1. Implementation details are described in Appendix B.2. Following MAE (He et al., 2022), an asymmetric Vi T (Dosovitskiy et al., 2020) encoder-decoder architecture was used for mask reconstruction. Recognizing the constraints of computational resources, we primarily employed the Vi T-B (Dosovitskiy et al., 2020) as the image encoder, ensuring a balance between efficiency and performance. The classification head consists of a single linear layer. It is intentionally made simple to focus on evaluating the effectiveness of the learned representations. The patch size was set to 2 for CIFAR-10 and CIFAR-100, and 16 for Image Net. For all experiments, unless otherwise specified, we used the default mask ratio of 75% as suggested in MAE (He et al., 2022). The number of unrolling steps in the algorithm for solving the MLO problem was set to 2. We employed the Adam W optimizer (Loshchilov & Hutter, 2017) with β values of 0.9 and 0.95 for optimizing all parameters. The learning rates were set specifically for different components: 1e 4 for the image encoder, and 4e 5 for both the classification head and the masking network. We used a batch size of 256. For training, we set the epoch number to 50 for the Image Net dataset and to 200 for the CIFAR datasets. All experiments were conducted on Nvidia A100 GPUs. Further information on our experimental settings can be found in Appendix B.