Morphing Tokens Draw Strong Masked Image Models

Authors: Taekyung Kim, Byeongho Heo, Dongyoon Han

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Image Net-1K and ADE20K demonstrate DTM s superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like i Naturalist, along with extensive empirical studies, supports DTM s effectiveness.
Researcher Affiliation Industry Taekyung Kim , Byeongho Heo, Dongyoon Han NAVER AI Lab EMAIL
Pseudocode Yes Algorithm 1: Token Morphing Function (ϕR) 1: input: token representation {vi}N i=1, iteration k, scheduler R = {rp}k p=1 2: define n N 3: define v0 i vi for i [1, N] 4: for p {1, . . . , k} do # k-iterative morphing 5: M p BIPARTITEMATCHING(vp, n) 6: M p ij M p ij/ Pn j =1 M p ij for all i, j # Normalize 7: vp+1 i Pn j=1 M p ijvp j for i [1, n rp] # Morph matched tokens 8: n n rp 9: return M = Πk p=1 M p 10: function BIPARTITEMATCHING(vp, n) # Standard bipartite matching algorithm 11: (Sp 1 , Sp 2 ) random split([1, 2, . . . , n]) # Split for Bipartite matching 12: sim [Sim(vp i , vp j ) for (i, j) Sp 1 Sp 2 ] # Measure similarity 13: σ sort(sim, order= descending )[rp] # Threshold for top-rp similarity 14: M p ij 1; M p M p\M p j s.t. Sim(vp i , vp j ) σ, (i, j) in Sp 1 Sp 2 15: return M p 16: end function
Open Source Code Yes Code is available at https://github.com/naver-ai/dtm.
Open Datasets Yes Experiments on Image Net-1K and ADE20K demonstrate DTM s superiority... Experiments on Image Net-1K and ADE20K demonstrate DTM s superiority... The effectiveness of our method is supported by accelerated fine-tuning trends after DTM pre-training, which highlights how spatially consistent targets are crucial. Our method shows further transferability on the i Naturalist (Van Horn et al., 2018) and fine-grained visual classification datasets (Van Horn et al., 2015; Krizhevsky, 2009; Khosla et al., 2011).
Dataset Splits Yes Fine-tuning on Image Net-1K. We fine-tune our pre-trained models on Image Net-1K (Russakovsky et al., 2015) by default following the standard protocol (He et al., 2022; Peng et al., 2022). Fine-tuning on ADE20K. Table K summarizes the fine-tuning recipe of Vi T/16 for the semantic segmentation task on ADE20K (Zhou et al., 2017). Transfer learning. We follow the fine-tuning recipes for DTM to conduct transfer learning to i Naturalist datasets... and FGVC datasets...
Hardware Specification Yes The model is fine-tuned using 8 V100-32GB GPUs.
Software Dependencies No We train our framework with Vi T-S/16, Vi T-B/16, and Vi T-L/16 for 300 epochs using Adam W with momentum (0.9, 0.98) and a batch size of 1024. ... We adopt commonly used values for Rand Augment, Mixup, Cutmix, and Label Smoothing.
Experiment Setup Yes Table I reports the implementation details for pre-training. We train our framework with Vi T-S/16, Vi T-B/16, and Vi T-L/16 for 300 epochs using Adam W with momentum (0.9, 0.98) and a batch size of 1024. We use a learning rate of 1.5 10 4 with cosine decay and warmup 10 epochs. ... Fine-tuning on Image Net-1K. We fine-tune our pre-trained models on Image Net-1K (Russakovsky et al., 2015) by default following the standard protocol (He et al., 2022; Peng et al., 2022). Specifically, pre-trained Vi T-S/-B/-L are fine-tuned for 300, 100, and 50 epochs, respectively. Optimization is performed with Adam W using a weight decay of 0.05. We use a layer-wise learning rate decay of 0.6 for Vi T-S and Vi T-B and 0.8 for Vi T-L. Learning rate is set to 5 10 4 with a linear warmup for 10 epochs for Vi T-S and Vi T-B and 5 epochs for Vi T-L. We adopt commonly used values for Rand Augment, Mixup, Cutmix, and Label Smoothing.