MaskBit: Embedding-free Image Generation via Bit Tokens

Authors: Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we undertake a systematic step-by-step study to elucidate the architectural design and training process necessary to create a modernized VQGAN model, referred to as VQGAN+. We provide a detailed ablation of key components in the VQGAN design, and propose several changes to them, including model and discriminator architecture, perceptual loss, and training recipe. ... We evaluate the proposed Mask Bit on class-conditional image generation. ... Tab. 1 summarizes the generation results on Image Net 256x256.
Researcher Affiliation Collaboration Mark Weber EMAIL Technical University of Munich, MCML Lijun Yu Carnegie Mellon University Qihang Yu Byte Dance Xueqing Deng Byte Dance Xiaohui Shen Byte Dance Daniel Cremers Technical University of Munich, MCML Liang-Chieh Chen Byte Dance
Pseudocode No The paper describes methods in prose and provides architectural diagrams (Figure 2, Figure 4, Figure 6), but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes The code for this project is available on https://github.com/markweberdev/maskbit.
Open Datasets Yes Image Net: Image Net (Deng et al., 2009) is one of the most popular benchmarks in computer vision. It has been used to benchmark image classification, class-conditional image generation, and more. License: Custom License, non-commercial. https://image-net.org/accessagreement Dataset website: https://image-net.org/ ... We present additional reconstruction results from the bit flipping analysis. ... Furthermore, we repeat this experiment in a zero-shot manner on COCO (Lin et al., 2014) with the same model only trained on Image Net.
Dataset Splits Yes We follow standard practices to train and evaluate the network on Image Net (Deng et al., 2009). ... The reconstruction FID (r FID) is computed against the validation split of Image Net at a resolution of 256. ... Specifically, the network generates a total of 50,000 samples for the 1,000 Image Net (Deng et al., 2009) classes.
Hardware Specification Yes We use 32 A100 GPUs for training Stage-I models. ... Stage-II models are trained with 64 A100 GPUs and take 4.2 days for the longest schedule (1.35M iterations).
Software Dependencies No The paper mentions 'Optimizer: Adam W (Loshchilov & Hutter, 2019)' but does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes B.1 Stage-I ... Base channels: 128 ... Discriminator loss weight: 0.02 ... Perceptual loss weight: 0.1 ... Optimizer: Adam W (Loshchilov & Hutter, 2019) ... Base LR: 1e-4 ... Training iterations: 1350000 ... Total Batchsize: 256 B.2 Stage-II ... Hidden dimension: 1024 ... Attention heads: 16 ... MLP dimension: 4096 ... Dropout: 0.1 ... Class label dropout: 0.1 ... Label smoothing: 0.1 ... Base LR: 1e-4 ... Training iterations: 1350000 ... Total Batchsize: 1024