PT-T2I/V: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Image/Video-Task

Authors: Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that PT-Di T achieves competitive performance while reducing computational complexity in image and video generation tasks (e.g., a reduction 59% compared to Di T and a reduction 34% compared to Pix Art-α). The paper includes sections like '4 EXPERIMENT', '4.1 EXPERIMENTAL SETUP', '4.2 QUALITATIVE ANALYSIS', '4.3 QUANTITATIVE ANALYSIS', '4.4 ALGORITHMIC EFFICIENCY COMPARISON', and '4.5 ABLATION STUDY'.
Researcher Affiliation Collaboration 1Shenzhen Campus of Sun Yat-Sen University, 2360 AI Research, 3Peng Cheng Laboratory, 4Guangdong Key Laboratory of Big Data Analysis and Processing, EMAIL EMAIL EMAIL. Authors are affiliated with Sun Yat-Sen University (academic) and 360 AI Research (industry).
Pseudocode No The paper describes the architecture and mechanisms using textual descriptions and mathematical formulas (e.g., Equation 1 and 2), along with a block diagram (Figure 4), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes The visual exhibition and code are available at https://360cvgroup.github.io/Qihoo-T2X/. We will open-source both our models and code to support the advancement of efficient diffusion transformers.
Open Datasets Yes Ablation Study. We conduct ablation experiments using a class-conditional version of PT-Di T/SClass (32M) on the Image Net (Deng et al., 2009) benchmark at 256 resolution. We conduct experiments to quantitatively evaluate PT-T2I using zero-shot FID-30K on the MS-COCO (Lin et al., 2014) 256 256 validation dataset. We evaluate PT-T2V on two standard video generation benchmarks, MSR-VTT (Xu et al., 2016) and UCF-101 (Soomro et al., 2012), at a resolution of 256. We collect a total of 50M data points for the training set, including 32M images with an aesthetic score of 5.5 or higher from Laion (Schuhmann et al., 2022). The Web Vid 10M (Bain et al., 2021) dataset is employed as the 256-resolution video training data. We utilize a subset of approximately 40k samples from G-Objaverse (Qiu et al., 2024).
Dataset Splits Yes We conduct experiments to quantitatively evaluate PT-T2I using zero-shot FID-30K on the MS-COCO (Lin et al., 2014) 256 256 validation dataset. We conduct ablation experiments using a class-conditional version of PT-Di T/SClass (32M) on the Image Net (Deng et al., 2009) benchmark at 256 resolution. For these well-known datasets, using the 'validation dataset' or 'benchmark' implies a standard, predefined split.
Hardware Specification Yes experimental tests indicate that we can train the PT-Di T/XL (1.1B) model for images at a resolution of 2048 2048 or for video at a resolution of 512 512 288 on the 64GB Ascend 910B (Huawei, 2024).
Software Dependencies No The paper mentions using the T5 large language model as the text encoder and the Adam W optimizer, but does not provide specific version numbers for these or any other software dependencies like libraries or frameworks (e.g., PyTorch, CUDA).
Experiment Setup Yes Detailed hyper-parameter settings and the model configurations for various PT-Di T scales are provided in Appendix. A.2. We train the models for 400,000 iterations with a batch size of 256, while maintaining an exponential moving average (EMA) of the model weights. During inference, we set the denoising step as 50 and use classifier-free guidance (cfg=6.0). The training objective for PT-T2I/V is v-prediction, with an extracted text token length of 120. Table 3 lists Resolution, Data, Learning Rate, Batch Size, and Iteration for training setups, and Table 4 details model configurations including Layers, Hidden Dim, Head Number, and Param (M).