Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment
Authors: Huayu Chen, Hang Su, Peize Sun, Jun Zhu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ( 1% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. |
| Researcher Affiliation | Collaboration | Huayu Chen1, Hang Su1, Peize Sun2, Jun Zhu1,3 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2The University of Hong Kong 3Shengshu Technology, Beijing |
| Pseudocode | Yes | Pseudo code in Appendix D. |
| Open Source Code | Yes | Code and models: https://github.com/thu-ml/CCA. We submit our source code in the supplementary material. Code and model weights are publicly accessible. |
| Open Datasets | Yes | Though both are class-conditioned models pretrained on Image Net, Llama Gen and VAR feature distinctively different tokenizer and architecture designs. We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset. |
| Dataset Splits | Yes | We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset. |
| Hardware Specification | Yes | We use a mix of NVIDIA-H100, NVIDIA A100, and NVIDIA A40 GPU cards for training. |
| Software Dependencies | No | The paper does not explicitly mention specific version numbers for software libraries or dependencies. It refers to specific models (Llama Gen and VAR) but not the underlying software stack with versions. |
| Experiment Setup | Yes | The training scheme and hyperparameters are mostly consistent with the pretraining phase. We report performance numbers after only one training epoch and find this to be sufficient for ideal performance. We fix β = 0.02 in Eq. 12 and select suitable λ for each model. Image resolutions are 384 384 for Llama Gen and 256 256 for VAR. Following the original work, we resize Llama Gen samples to 256 256 whenever required for evaluation. Table 4 reports hyperparameters for chosen models in Figure 1 and Figure 6. All models are fine-tuned for 1 epoch on the Image Net dataset. Batch size 256, Learning rate (1e-5 or 2e-5). |