Efficient Object-Centric Representation Learning using Masked Generative Modeling

Authors: Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained while maintaining strong or competitive performance on object segmentation and compositional generation tasks.
Researcher Affiliation Academia Akihiro Nakano EMAIL Graduate School of Engineering The University of Tokyo
Pseudocode Yes For inference, we use the iterative parallel decoding scheme of Mask GIT. We start with a blank canvas with all tokens masked out and operate the following procedures iteratively for T steps; (1) Predict the probabilities for all the masked tokens at step t, z<t = z mt. (2) Sample a token based on the predicted probabilities. (3) Compute the number of tokens to mask using the mask scheduler function. (4) Decide tokens to unmask for the next iteration, zt using the schedule from (3) and the log probabilities from (1) used as confidence score.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for MOGENT, nor does it provide a direct link to a code repository for their methodology.
Open Datasets Yes We evaluate on four datasets with distinct characteristics: 3D Shapes dataset (Burgess & Kim, 2018), CLEVR dataset (Johnson et al., 2017), CLEVRTex dataset (Karazija et al., 2021), and Celeb A dataset (Liu et al., 2015).
Dataset Splits No The paper mentions '3D Shapes dataset consists of 400K training images', but does not provide explicit details for training, validation, and test splits for all datasets, nor does it cite predefined splits used for all of them.
Hardware Specification Yes All metrics were computed on a single NVIDIA Tesla V100 GPU, with batch size of 64 for training and 1 for test.
Software Dependencies No The paper mentions using 'Adam optimizer (Kingma, 2015)' but does not specify software dependencies like Python, PyTorch, or TensorFlow with their respective version numbers.
Experiment Setup Yes The hyperparameters used for our experiments are reported in Table 8 and Table 9. We used a fixed learning rate of 3e-4 for the DVAE and a learning rate of 1e-4 with linear warmup for stable learning.