Visual Generation Without Guidance
Authors: Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science & Technology, Tsinghua University 2Sheng Shu, Beijing, China. Correspondence to: Jun Zhu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Guidance-Free Training (Diffusion) |
| Open Source Code | Yes | Code: https://github.com/thu-ml/GFT. |
| Open Datasets | Yes | We train C2I models on Image Net-256x256 (Deng et al., 2009). For T2I models, we use a subset of the LAION-Aesthetic 5+ (Schuhmann et al., 2022), consisting of 18 million image-text pairs. Our codebases are directly modified from the official CFG implementation of each respective baseline, keeping most hyperparameters consistent with CFG training. We use official OPENAI evaluation scripts to evaluate our C2I models. For T2I models, we evaluate our model on zero-shot COCO 2014 (Lin et al., 2014). |
| Dataset Splits | Yes | For evaluation, following Giga GAN (Kang et al., 2023) and DMD (Yin et al., 2024), we generate images using 30K prompts from the COCO2014 (Lin et al., 2014) validation set, downsample them to 256 × 256, and compare with 40,504 real images from the same validation set. |
| Hardware Specification | Yes | We use 8 × 80GB H100 GPU cards. (Table 1 caption) We employ a mix of H100, A100 and A800 GPU cards for experimentation. (Appendix D) |
| Software Dependencies | No | The paper mentions software like "Dpm-solver++ (Lu et al., 2022)" and refers to official codebases for baselines, but does not specify version numbers for any libraries or programming languages. |
| Experiment Setup | Yes | For all models, we keep training hyperparameters and other design choices consistent with their official codebases if not otherwise stated. We employ a mix of H100, A100 and A800 GPU cards for experimentation. Di T. We mainly apply GFT to fine-tune Di T-XL/2 (28 epochs, 2% of pretraining epochs) and train Di T-B/2 from scratch (80 epochs, following the original Di T paper’s settings (Peebles & Xie, 2023)). Since the Di T-B/2 pretraining checkpoint is not publicly available, we reproduce its pretraining experiment. For all experiments, we use a batch size of 256 and a learning rate of 1e-4. For Di T-XL/2 fine-tuning experiments, we employ a cosine-decay learning rate scheduler. ... (and similar details for VAR, Llama Gen, MAR, and Stable Diffusion 1.5 in Appendix D) |