reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Authors: Han Guo, Ramtin Hosseini, Ruiyi Zhang, Sai Ashish Somayajula, Ranak Roy Chowdhury, Rajesh K. Gupta, Pengtao Xie

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental findings highlight MLO-MAE s significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at https://github.com/Alexiland/MLO-MAE
Researcher Affiliation	Academia	Han Guo EMAIL UC San Diego Ramtin Hosseini EMAIL UC San Diego Ruiyi Zhang EMAIL UC San Diego Sai Ashish Somayajula EMAIL UC San Diego Ranak Roy Chowdhury EMAIL UC San Diego Rajesh K. Gupta EMAIL UC San Diego Pengtao Xie EMAIL UC San Diego
Pseudocode	Yes	Algorithm 1 MLO-MAE Optimization Algorithm
Open Source Code	Yes	Our code is available at https://github.com/Alexiland/MLO-MAE
Open Datasets	Yes	Our approach outperforms a range of leading-edge methods in learning representations, as evidenced across various datasets such as CIFAR-10, CIFAR-100, and Image Net-1K. Our method showcases remarkable transfer learning abilities, in fine-grained classification, semantic segmentation, and object detection tasks, demonstrated on datasets including CUB-200-2011, Stanford Cars, i Naturalist 2019, ADE20K, and MS-COCO.
Dataset Splits	Yes	We randomly split the training set of Image Net by a ratio of 80/20 to be the new training set and the new validation set. We use the same new training set in Stage I and Stage II for training, while use the new validation set in Stage III for training the masking network.
Hardware Specification	Yes	All experiments were conducted on Nvidia A100 GPUs.
Software Dependencies	No	The paper mentions using PyTorch code and AdamW optimizer, and MMSegmentation, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	In our method, the masking network is structured with multiple layers. Initially, there is a linear layer, where the input size is determined by the product of the number of patches (196 for Image Net and 256 for CIFAR) and the embedding dimension (we used the patch embedding method in Vi T, with a dimension of 768), and it has a hidden size of 512. This is followed by a Re LU layer. Next, there is another linear layer, which takes an input size of 512 and produces an output size equivalent to the number of patches. Finally, a sigmoid activation function is applied to the output to generate probabilities in the range of 0 to 1. Implementation details are described in Appendix B.2. Following MAE (He et al., 2022), an asymmetric Vi T (Dosovitskiy et al., 2020) encoder-decoder architecture was used for mask reconstruction. Recognizing the constraints of computational resources, we primarily employed the Vi T-B (Dosovitskiy et al., 2020) as the image encoder, ensuring a balance between efficiency and performance. The classification head consists of a single linear layer. It is intentionally made simple to focus on evaluating the effectiveness of the learned representations. The patch size was set to 2 for CIFAR-10 and CIFAR-100, and 16 for Image Net. For all experiments, unless otherwise specified, we used the default mask ratio of 75% as suggested in MAE (He et al., 2022). The number of unrolling steps in the algorithm for solving the MLO problem was set to 2. We employed the Adam W optimizer (Loshchilov & Hutter, 2017) with β values of 0.9 and 0.95 for optimizing all parameters. The learning rates were set specifically for different components: 1e 4 for the image encoder, and 4e 5 for both the classification head and the masking network. We used a batch size of 256. For training, we set the epoch number to 50 for the Image Net dataset and to 200 for the CIFAR datasets. All experiments were conducted on Nvidia A100 GPUs. Further information on our experimental settings can be found in Appendix B.