Efficient Object-Centric Representation Learning using Masked Generative Modeling
Authors: Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained while maintaining strong or competitive performance on object segmentation and compositional generation tasks. |
| Researcher Affiliation | Academia | Akihiro Nakano EMAIL Graduate School of Engineering The University of Tokyo |
| Pseudocode | Yes | For inference, we use the iterative parallel decoding scheme of Mask GIT. We start with a blank canvas with all tokens masked out and operate the following procedures iteratively for T steps; (1) Predict the probabilities for all the masked tokens at step t, z<t = z mt. (2) Sample a token based on the predicted probabilities. (3) Compute the number of tokens to mask using the mask scheduler function. (4) Decide tokens to unmask for the next iteration, zt using the schedule from (3) and the log probabilities from (1) used as confidence score. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for MOGENT, nor does it provide a direct link to a code repository for their methodology. |
| Open Datasets | Yes | We evaluate on four datasets with distinct characteristics: 3D Shapes dataset (Burgess & Kim, 2018), CLEVR dataset (Johnson et al., 2017), CLEVRTex dataset (Karazija et al., 2021), and Celeb A dataset (Liu et al., 2015). |
| Dataset Splits | No | The paper mentions '3D Shapes dataset consists of 400K training images', but does not provide explicit details for training, validation, and test splits for all datasets, nor does it cite predefined splits used for all of them. |
| Hardware Specification | Yes | All metrics were computed on a single NVIDIA Tesla V100 GPU, with batch size of 64 for training and 1 for test. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma, 2015)' but does not specify software dependencies like Python, PyTorch, or TensorFlow with their respective version numbers. |
| Experiment Setup | Yes | The hyperparameters used for our experiments are reported in Table 8 and Table 9. We used a fixed learning rate of 3e-4 for the DVAE and a learning rate of 1e-4 with linear warmup for stable learning. |