MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Authors: Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, Leilei Gan, Hao Jiang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The framework s performance is verified across an array of evaluative measures, i.e. MS-COCO benchmark, T2I-Comp Bench, and Human Evaluation. Performance Comparisons and Analysis Evaluation Benchmarks. We select three benchmarks for comparison, including MSCOCO dataset (Lin et al. 2014), T2I-Comp Bench (Huang et al. 2023). MSCOCO Benchmark. We use the Frechet Inception Distance (FID) to evaluate the quality of synthesized images. As shown in 2, our proposed MARS, with only 7B trainable parameters, scores 6.92 on FID, which is a notable achievement. Compared to the auto-regressive counterpart Parti, we use fewer parameters (14B vs 20B) and smaller data sizes (0.2B vs 4.8B), achieving competitive performance (6.92 vs 7.22). Against the diffusion model SDv1.5, we achieve superior performance (6.92 vs 9.22) with less training budget (587 vs 6250 A100 GPU Days). T2I Comp Bench Performance. The empirical data presented in Tab. 1 delineates the superior performance of our proposed MARS within the T2I-Comp Bench benchmark, underscoring its proficiency in attribute binding, delineation of object relationships, and the synthesis of intricate compositions. Ablation Study A Closer Look at Sem VIE During Stage-I training, we aimed to optimize the alignment of visual and linguistic modalities by employing both text-to-image (text2image) and image-to-text (image2text) pre-training tasks. However, the shared parameter design led to the logit drift problem, as described in Chameleon or Unified-IO-2. This issue arises from the intrinsic disparities between modalities and is evi- denced by detrimental outcomes, including a 1.89 reduction in FID, as shown in Tab. 3. To mitigate this, we introduced a specialized Visual Expert to ensure that text and visual modalities do not share parameters. This approach effectively alleviated the problem, as observed in our training results, as shown in Fig. 5. The introduction of the Visual Expert underscores the necessity for specialized architectures adept at managing the inherent challenges of multi-modal data integration.
Researcher Affiliation Collaboration Wanggui He1*, Siming Fu1*, Mushui Liu2*, Xierui Wang2, Wenyi Xiao2, Fangxun Shu1, Yi Wang2, Lei Zhang2, Zhelun Yu3, Haoyuan Li1, Ziwei Huang2, Leilei Gan2 , Hao Jiang 1 1Alibaba Group 2Zhejiang University 3Fudan University EMAIL, EMAIL, EMAIL, y EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology with mathematical equations and textual explanations of the architecture (Sem VIE, Attention-MoE, FFN-MoE) and training strategy, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing code, nor does it provide any links to a code repository or mention code in supplementary materials.
Open Datasets Yes Evaluation Benchmarks. We select three benchmarks for comparison, including MSCOCO dataset (Lin et al. 2014), T2I-Comp Bench (Huang et al. 2023).
Dataset Splits No The paper mentions using MSCOCO dataset and T2I-Comp Bench for evaluation, but it does not specify explicit training, validation, or test splits used for its own experiments, or refer to standard splits for these benchmarks. It only mentions total sample counts for training specific stages: "an extensive dataset of approximately 200 million text-image pairs", "50 million pairs of text and corresponding images", and "Ten million triplet (low-resolution image, caption, high-resolution image) samples were used to train the cascaded super-resolution model." These are overall dataset sizes for training phases, not explicit train/val/test splits.
Hardware Specification Yes Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications. Remarkably, with a mere 587 A100 GPU days, equating to only 9% of the training duration required by Stable Diffusion v1.5, MARS demonstrates its superiority over existing large-scale text-to-image (T2I) models.
Software Dependencies No The paper mentions using AdamW as an optimizer and DeepSpeed's ZeRO-3 optimization, but it does not provide specific version numbers for any software dependencies, libraries, or programming languages.
Experiment Setup Yes Experiment Details We employ Adam W (Loshchilov and Hutter 2017) as the optimizer, with a beta parameter of 0.95 and weight decay set at 0.1. The peak learning rate is established at 1e-4, and a warm-up strategy is employed with a ratio of 0.01. For images with a resolution of 256 256 pixels, the batch size per GPU is set at 64, while for 512 512 pixel images, it is set at 24, leading to total batch sizes of 4096 and 1536, respectively. The training utilized Deep Speed s Ze RO-3 (Rajbhandari et al. 2020) optimization. The training epochs for Stage-I, Stage-II, and Stage-III of the model are configured to 1, 2, and 1 epochs, respectively.