Masked Generative Nested Transformers with Decode Time Scaling

Authors: Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We rigorously experiment with Image Net256 256 , Image Net128 128, UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost 3 less compute than baseline, our model obtains competitive performance. Section 5. Experiments and Results
Researcher Affiliation Collaboration 1Google Deep Mind 2University of California, Los Angeles. Correspondence to: Sahil Goyal <EMAIL>, Sujoy Paul <EMAIL>.
Pseudocode Yes Algorithm 1 Ma GNe TS Decoding Algorithm
Open Source Code No The paper does not explicitly state that the authors are releasing their code for the methodology described in this paper, nor does it provide a link to a code repository. It mentions using pretrained tokenizers from other works but not their own code.
Open Datasets Yes Datasets. We evaluate our model on Image Net256 256 and Image Net128 128 (Deng et al., 2009) for image generation, UCF101 (Soomro et al., 2012) for video generation and Kinetics600 (Carreira et al., 2018) for frame prediction (5frame condition).
Dataset Splits No The paper mentions evaluating on well-known datasets like ImageNet, UCF101, and Kinetics600, which have standard splits. However, it does not explicitly state the training/test/validation splits used, nor does it specify if standard splits were followed with explicit percentages or counts. The text only says: 'We train our model for 270 epochs for all the experiments.' and 'We drop input class condition labels for 10% of the training batches in image generation'.
Hardware Specification Yes All experiments are run on a single A100 GPU. ... We implement Ma GNe TS on a single TPUv5 chip
Software Dependencies No The paper mentions using 'Bert model (Devlin et al., 2019) as a transformer backbone' and 'pretrained tokenizers from Mask GIT (Chang et al., 2022) ... and MAGVIT (Yu et al., 2023a)'. However, it does not provide specific version numbers for these or any other software libraries or frameworks used.
Experiment Setup Yes We train our model for 270 epochs for all the experiments. ... We drop input class condition labels for 10% of the training batches in image generation... We mention the details of sampling hyperparameters in Appendix B. ... Table 9: Best Sampling Hyperparameters... We use bias=0.5 and scale=0.8 for all experiments.