Visual Generation Without Guidance

Authors: Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free.
Researcher Affiliation Collaboration 1Department of Computer Science & Technology, Tsinghua University 2Sheng Shu, Beijing, China. Correspondence to: Jun Zhu <EMAIL>.
Pseudocode Yes Algorithm 1 Guidance-Free Training (Diffusion)
Open Source Code Yes Code: https://github.com/thu-ml/GFT.
Open Datasets Yes We train C2I models on Image Net-256x256 (Deng et al., 2009). For T2I models, we use a subset of the LAION-Aesthetic 5+ (Schuhmann et al., 2022), consisting of 18 million image-text pairs. Our codebases are directly modified from the official CFG implementation of each respective baseline, keeping most hyperparameters consistent with CFG training. We use official OPENAI evaluation scripts to evaluate our C2I models. For T2I models, we evaluate our model on zero-shot COCO 2014 (Lin et al., 2014).
Dataset Splits Yes For evaluation, following Giga GAN (Kang et al., 2023) and DMD (Yin et al., 2024), we generate images using 30K prompts from the COCO2014 (Lin et al., 2014) validation set, downsample them to 256 × 256, and compare with 40,504 real images from the same validation set.
Hardware Specification Yes We use 8 × 80GB H100 GPU cards. (Table 1 caption) We employ a mix of H100, A100 and A800 GPU cards for experimentation. (Appendix D)
Software Dependencies No The paper mentions software like "Dpm-solver++ (Lu et al., 2022)" and refers to official codebases for baselines, but does not specify version numbers for any libraries or programming languages.
Experiment Setup Yes For all models, we keep training hyperparameters and other design choices consistent with their official codebases if not otherwise stated. We employ a mix of H100, A100 and A800 GPU cards for experimentation. Di T. We mainly apply GFT to fine-tune Di T-XL/2 (28 epochs, 2% of pretraining epochs) and train Di T-B/2 from scratch (80 epochs, following the original Di T paper’s settings (Peebles & Xie, 2023)). Since the Di T-B/2 pretraining checkpoint is not publicly available, we reproduce its pretraining experiment. For all experiments, we use a batch size of 256 and a learning rate of 1e-4. For Di T-XL/2 fine-tuning experiments, we employ a cosine-decay learning rate scheduler. ... (and similar details for VAR, Llama Gen, MAR, and Stable Diffusion 1.5 in Appendix D)