Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Authors: Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, Shuicheng YAN

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate Meissonic s capabilities, demonstrating its potential as a new standard in text-to-image synthesis. We evaluate Meissonic using various qualitative and quantitative metrics, including HPS, MPS, Gen Eval benchmarks, and GPT4o assessments, demonstrating its superior performance and efficiency.
Researcher Affiliation Collaboration 1National University of Singapore 2Skywork AI 3HKUST(GZ) 4HKUST 5UC Berkeley 6ZJU
Pseudocode No The paper describes methods and architectures but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/viiika/Meissonic
Open Datasets Yes We curated the deduplicated LAION2B dataset by filtering out images with aesthetic scores below 4.5, watermark probabilities exceeding 50%, and other criteria outlined in Kolors (2024). This meticulous selection resulted in approximately 200 million images, which were employed for training at a resolution of 256 256 in this initial stage. Additionally, we incorporate 1.2 million synthetic image-text pairs with refined captions exceeding 50 words, primarily derived from publicly available high-quality synthetic datasets... For image editing tasks, we benchmarked Meissonic against state-of-the-art models using the EMU-Edit dataset (Sheynin et al., 2024).
Dataset Splits No The paper describes training on various datasets in stages, mentioning the number of images or image-text pairs used for each stage (e.g., "approximately 200 million images", "around 10 million image-text pairs", "approximately 6 million samples"), but it does not explicitly provide training/validation/test splits for these datasets. For the EMU-Edit dataset, it states "we randomly sampled 500 examples per benchmark for testing" but this is for specific benchmark evaluation, not for the main model training splits.
Hardware Specification Yes Meissonic is trained in approximately 48 H100 GPU days... FP16 Tensor Core of A100 is 312 TFLOPS and H100 is 756.5 TFLOPS. GPU hours are adjusted from 48 H100 days based on this rate. Inference time is assessed using an A100 GPU with fp16 models. On A6000 GPUs (48 GB), the execution of Magic Brush (Zhang et al., 2024a) took approximately 36 hours for SD1.5 and 60 hours for SDXL.
Software Dependencies No The paper mentions using specific models (e.g., CLIP-Vi T-H-142 text encoder) and methods (e.g., classifier-free guidance, cross-entropy loss, training strategy from LLa Ma Touvron et al. (2023)), but it does not specify software libraries or frameworks with their version numbers required for reproduction.
Experiment Setup Yes First, we train Meissonic-256 with a batch size of 2,048 for 100,000 steps. Second, we continue training Meissonic-512 with a batch size of 512 for an additional 100,000 steps. Third, we continue training Meissonic with a batch size of 256 for 42,000 steps with a resolution of 1024 1024. All experiments are carried out with a fixed learning rate of 1 10 4. All inferences in this paper are performed with CFG = 9 and 48 steps.