SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers
Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use five mainstream evaluation metrics to evaluate the performance of our Sana, namely FID, Clip Score, Gen Eval (Ghosh et al., 2024), DPG-Bench (Hu et al., 2024), and Image Reward (Xu et al., 2024), comparing it with SOTA methods. FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2MIT 3Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Flow-DPM-Solver (Modified from DPM-Solver++) Require: initial value x T , time steps {ti}M i=0, data prediction model xθ, velocity prediction model vθ, timestep shift factor s 1: Denote hi := λti λti 1 for i = 1, . . . , M 2: σti = s σti 1+(s 1) σti , αti = 1 σti ▹ Hyper-parameter and Time-step transformation 3: xθ( xti, ti) = xti σtivθ( xti, ti) ▹ Model output transformation 4: xt0 x T . Initialize an empty buffer Q. 5: Qbuffer xθ( xt0, t0) 6: xt1 σt1 σt0 xt0 αt1 e h1 1 xθ( xt0, t0) 7: Qbuffer xθ( xt1, ti) 8: for i = 2 to M do 9: ri hi 1 hi 10: Di 1 + 1 2ri xθ( xti 1, ti 1) 1 2ri xθ( xti 2, ti 2) 11: xti σti σti 1 xti 1 αti e hi 1 Di 12: if i < M then 13: Qbuffer xθ( xti, ti) 14: end if 15: end for 16: return xt M |
| Open Source Code | No | Code and model will be publicly released. |
| Open Datasets | Yes | FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney. |
| Dataset Splits | No | The paper evaluates on the MJHQ-30K dataset but does not explicitly describe the training, validation, or test splits used for this dataset. It mentions training steps and resolutions but not data partitioning. |
| Hardware Specification | Yes | Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024x1024 resolution image. It takes only 0.37s to generate a 1024x1024 resolution image on a customer-grade 4090 GPU, providing a powerful foundation model for real-time image generation. The speed is tested on one A100 GPU with FP16 Precision. |
| Software Dependencies | No | The paper mentions using Triton (Tillet et al., 2019) and CUDA C++ for kernel implementation but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train all the models with the same training setting with 52K iterations. multi-stage training strategy to improve training stability, which involving finetune our AE-F32C32 on 1024 1024 images we discover a useful trick that further accelerates model convergence by initializing a small learnable scale factor (e.g., 0.01) and multiplying it by the text embedding. This adaptation occurs within merely 10K training steps, using a total batch size of 1024. |