Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization
Authors: Han Guo, Ramtin Hosseini, Ruiyi Zhang, Sai Ashish Somayajula, Ranak Roy Chowdhury, Rajesh K. Gupta, Pengtao Xie
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental findings highlight MLO-MAE s significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at https://github.com/Alexiland/MLO-MAE |
| Researcher Affiliation | Academia | Han Guo EMAIL UC San Diego Ramtin Hosseini EMAIL UC San Diego Ruiyi Zhang EMAIL UC San Diego Sai Ashish Somayajula EMAIL UC San Diego Ranak Roy Chowdhury EMAIL UC San Diego Rajesh K. Gupta EMAIL UC San Diego Pengtao Xie EMAIL UC San Diego |
| Pseudocode | Yes | Algorithm 1 MLO-MAE Optimization Algorithm |
| Open Source Code | Yes | Our code is available at https://github.com/Alexiland/MLO-MAE |
| Open Datasets | Yes | Our approach outperforms a range of leading-edge methods in learning representations, as evidenced across various datasets such as CIFAR-10, CIFAR-100, and Image Net-1K. Our method showcases remarkable transfer learning abilities, in fine-grained classification, semantic segmentation, and object detection tasks, demonstrated on datasets including CUB-200-2011, Stanford Cars, i Naturalist 2019, ADE20K, and MS-COCO. |
| Dataset Splits | Yes | We randomly split the training set of Image Net by a ratio of 80/20 to be the new training set and the new validation set. We use the same new training set in Stage I and Stage II for training, while use the new validation set in Stage III for training the masking network. |
| Hardware Specification | Yes | All experiments were conducted on Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using PyTorch code and AdamW optimizer, and MMSegmentation, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In our method, the masking network is structured with multiple layers. Initially, there is a linear layer, where the input size is determined by the product of the number of patches (196 for Image Net and 256 for CIFAR) and the embedding dimension (we used the patch embedding method in Vi T, with a dimension of 768), and it has a hidden size of 512. This is followed by a Re LU layer. Next, there is another linear layer, which takes an input size of 512 and produces an output size equivalent to the number of patches. Finally, a sigmoid activation function is applied to the output to generate probabilities in the range of 0 to 1. Implementation details are described in Appendix B.2. Following MAE (He et al., 2022), an asymmetric Vi T (Dosovitskiy et al., 2020) encoder-decoder architecture was used for mask reconstruction. Recognizing the constraints of computational resources, we primarily employed the Vi T-B (Dosovitskiy et al., 2020) as the image encoder, ensuring a balance between efficiency and performance. The classification head consists of a single linear layer. It is intentionally made simple to focus on evaluating the effectiveness of the learned representations. The patch size was set to 2 for CIFAR-10 and CIFAR-100, and 16 for Image Net. For all experiments, unless otherwise specified, we used the default mask ratio of 75% as suggested in MAE (He et al., 2022). The number of unrolling steps in the algorithm for solving the MLO problem was set to 2. We employed the Adam W optimizer (Loshchilov & Hutter, 2017) with β values of 0.9 and 0.95 for optimizing all parameters. The learning rates were set specifically for different components: 1e 4 for the image encoder, and 4e 5 for both the classification head and the masking network. We used a batch size of 256. For training, we set the epoch number to 50 for the Image Net dataset and to 200 for the CIFAR datasets. All experiments were conducted on Nvidia A100 GPUs. Further information on our experimental settings can be found in Appendix B. |