Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

Authors: Gao Peng, Le Zhuo, Dongyang Liu, DU, Xu Luo, Longtian Qiu, Yuhang Zhang, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xie, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, Tong He, HE, Junjun He, Yu Qiao, Hongsheng Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTS 3.1 VALIDATING FLAG-DIT ON IMAGENET Training Setups We perform experiments on label-conditioned 256 256 and 512 512 Image Net (Deng et al., 2009) generation to validate the advantages of Flag-Di T over Di T (Peebles & Xie, 2023b). We train a specialized version of Flag-Di T, i.e., Flag-Di T-D, which adopts the original DDPM formulation (Ho et al., 2020; Nichol & Dhariwal, 2021) in Di T to enable a fair comparison with the original Di T. Table 1: Full comparison between Flag-Di T-D and Flag-Di T with other models on Image Net 256 256 and 512 512 label-conditional generation.
Researcher Affiliation Academia 1Shanghai AI Laboratory 2CUHK MMLab 3CPII under Inno HK 4Shenzhen Institute of Advanced Technology, Chinese Academy of Science
Pseudocode No The paper describes algorithms and formulations using mathematical equations and text (e.g., Equations 1-7), and architectural diagrams (Figure 1, Figure 2), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All code and checkpoints of Lumina-T2X are released at Git Hub to further foster creativity, transparency, and diversity in the generative AI community.
Open Datasets Yes We perform experiments on label-conditioned 256 256 and 512 512 Image Net (Deng et al., 2009) generation to validate the advantages of Flag-Di T over Di T (Peebles & Xie, 2023b). Lumina-T2V is independently trained on a subset of the Panda-70M dataset (Chen et al., 2024c) and the collected Pexel dataset, comprising of 15 million and 40,000 videos, respectively. We employ the LVIS subset of the Objaverse (Deitke et al., 2023) dataset, which includes approximately 40K 3D objects. For a fair and reproducible comparison against other competing methods, we use the benchmark LJSpeech dataset (Ito, 2017).
Dataset Splits No The paper mentions several datasets (ImageNet, Panda-70M, Pexel, Objaverse, LJSpeech) and their usage for training, but it does not explicitly provide details about specific training, validation, or test splits for any of these datasets. For instance, for LJSpeech, it mentions 'LJSpeech consists of 13,100 audio clips of 22050 Hz from a female speaker for about 24 hours in total,' but no split information is given.
Hardware Specification Yes Table 3: Training throughput as measured with Image Net on a single 8 A100 machine. For the low-resolution stage, we trained the Lumina-T2MV model with a batch size of 64 for 100K iterations, while for the high-resolution stage, we trained the Lumina-T2MV model with a batch size of 16 for 180K iterations. The training is conducted on 16 NVIDIA A100 GPUs, each with 80GB of memory. The Lumina-T2Speech has been trained for 200,000 steps using 1 NVIDIA 4090 GPU with a batch size of 64 sentences.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., PyTorch 1.9, CUDA 11.1). It mentions using an 'adam optimizer' but not a specific library version.
Experiment Setup Yes Table 4: We compare the training setups of Lumina-T2I with Pix Art-α. Lumina-T2I is trained purely on 14 million filtered high-quality (HQ) image-text pairs, whereas Pix Art-α benefits from an additional 11 million high-quality natural image-text pairs. Remarkably, despite having 8.3 times more parameters, Lumina-T2I only incurs 35% of the computational costs compared to Pix Art-α0.6B. For the low-resolution stage, we trained the Lumina-T2MV model with a batch size of 64 for 100K iterations, while for the high-resolution stage, we trained the Lumina-T2MV model with a batch size of 16 for 180K iterations. The Lumina-T2Speech has been trained for 200,000 steps using 1 NVIDIA 4090 GPU with a batch size of 64 sentences. The adam optimizer is used with β1 = 0.9, β2 = 0.98, ϵ = 10 9.