SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers

Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use five mainstream evaluation metrics to evaluate the performance of our Sana, namely FID, Clip Score, Gen Eval (Ghosh et al., 2024), DPG-Bench (Hu et al., 2024), and Image Reward (Xu et al., 2024), comparing it with SOTA methods. FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney.
Researcher Affiliation Collaboration 1NVIDIA 2MIT 3Tsinghua University
Pseudocode Yes Algorithm 1 Flow-DPM-Solver (Modified from DPM-Solver++) Require: initial value x T , time steps {ti}M i=0, data prediction model xθ, velocity prediction model vθ, timestep shift factor s 1: Denote hi := λti λti 1 for i = 1, . . . , M 2: σti = s σti 1+(s 1) σti , αti = 1 σti ▹ Hyper-parameter and Time-step transformation 3: xθ( xti, ti) = xti σtivθ( xti, ti) ▹ Model output transformation 4: xt0 x T . Initialize an empty buffer Q. 5: Qbuffer xθ( xt0, t0) 6: xt1 σt1 σt0 xt0 αt1 e h1 1 xθ( xt0, t0) 7: Qbuffer xθ( xt1, ti) 8: for i = 2 to M do 9: ri hi 1 hi 10: Di 1 + 1 2ri xθ( xti 1, ti 1) 1 2ri xθ( xti 2, ti 2) 11: xti σti σti 1 xti 1 αti e hi 1 Di 12: if i < M then 13: Qbuffer xθ( xti, ti) 14: end if 15: end for 16: return xt M
Open Source Code No Code and model will be publicly released.
Open Datasets Yes FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney.
Dataset Splits No The paper evaluates on the MJHQ-30K dataset but does not explicitly describe the training, validation, or test splits used for this dataset. It mentions training steps and resolutions but not data partitioning.
Hardware Specification Yes Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024x1024 resolution image. It takes only 0.37s to generate a 1024x1024 resolution image on a customer-grade 4090 GPU, providing a powerful foundation model for real-time image generation. The speed is tested on one A100 GPU with FP16 Precision.
Software Dependencies No The paper mentions using Triton (Tillet et al., 2019) and CUDA C++ for kernel implementation but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train all the models with the same training setting with 52K iterations. multi-stage training strategy to improve training stability, which involving finetune our AE-F32C32 on 1024 1024 images we discover a useful trick that further accelerates model convergence by initializing a small learnable scale factor (e.g., 0.01) and multiplying it by the text embedding. This adaptation occurs within merely 10K training steps, using a total batch size of 1024.