reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers

Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use five mainstream evaluation metrics to evaluate the performance of our Sana, namely FID, Clip Score, Gen Eval (Ghosh et al., 2024), DPG-Bench (Hu et al., 2024), and Image Reward (Xu et al., 2024), comparing it with SOTA methods. FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney.
Researcher Affiliation	Collaboration	1NVIDIA 2MIT 3Tsinghua University
Pseudocode	Yes	Algorithm 1 Flow-DPM-Solver (Modified from DPM-Solver++) Require: initial value x T , time steps {ti}M i=0, data prediction model xθ, velocity prediction model vθ, timestep shift factor s 1: Denote hi := λti λti 1 for i = 1, . . . , M 2: σti = s σti 1+(s 1) σti , αti = 1 σti ▹ Hyper-parameter and Time-step transformation 3: xθ( xti, ti) = xti σtivθ( xti, ti) ▹ Model output transformation 4: xt0 x T . Initialize an empty buffer Q. 5: Qbuffer xθ( xt0, t0) 6: xt1 σt1 σt0 xt0 αt1 e h1 1 xθ( xt0, t0) 7: Qbuffer xθ( xt1, ti) 8: for i = 2 to M do 9: ri hi 1 hi 10: Di 1 + 1 2ri xθ( xti 1, ti 1) 1 2ri xθ( xti 2, ti 2) 11: xti σti σti 1 xti 1 αti e hi 1 Di 12: if i < M then 13: Qbuffer xθ( xti, ti) 14: end if 15: end for 16: return xt M
Open Source Code	No	Code and model will be publicly released.
Open Datasets	Yes	FID and Clip Score are evaluated on the MJHQ-30K (Li et al., 2024a) dataset, which contains 30K images from Midjourney.
Dataset Splits	No	The paper evaluates on the MJHQ-30K dataset but does not explicitly describe the training, validation, or test splits used for this dataset. It mentions training steps and resolutions but not data partitioning.
Hardware Specification	Yes	Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024x1024 resolution image. It takes only 0.37s to generate a 1024x1024 resolution image on a customer-grade 4090 GPU, providing a powerful foundation model for real-time image generation. The speed is tested on one A100 GPU with FP16 Precision.
Software Dependencies	No	The paper mentions using Triton (Tillet et al., 2019) and CUDA C++ for kernel implementation but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We train all the models with the same training setting with 52K iterations. multi-stage training strategy to improve training stability, which involving finetune our AE-F32C32 on 1024 1024 images we discover a useful trick that further accelerates model convergence by initializing a small learnable scale factor (e.g., 0.01) and multiplying it by the text embedding. This adaptation occurs within merely 10K training steps, using a total batch size of 1024.