Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Authors: Huayu Chen, Hang Su, Peize Sun, Jun Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ( 1% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half.
Researcher Affiliation Collaboration Huayu Chen1, Hang Su1, Peize Sun2, Jun Zhu1,3 1Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2The University of Hong Kong 3Shengshu Technology, Beijing
Pseudocode Yes Pseudo code in Appendix D.
Open Source Code Yes Code and models: https://github.com/thu-ml/CCA. We submit our source code in the supplementary material. Code and model weights are publicly accessible.
Open Datasets Yes Though both are class-conditioned models pretrained on Image Net, Llama Gen and VAR feature distinctively different tokenizer and architecture designs. We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset.
Dataset Splits Yes We leverage CCA to finetune multiple Llama Gen and VAR models of various sizes on the standard Image Net dataset.
Hardware Specification Yes We use a mix of NVIDIA-H100, NVIDIA A100, and NVIDIA A40 GPU cards for training.
Software Dependencies No The paper does not explicitly mention specific version numbers for software libraries or dependencies. It refers to specific models (Llama Gen and VAR) but not the underlying software stack with versions.
Experiment Setup Yes The training scheme and hyperparameters are mostly consistent with the pretraining phase. We report performance numbers after only one training epoch and find this to be sufficient for ideal performance. We fix β = 0.02 in Eq. 12 and select suitable λ for each model. Image resolutions are 384 384 for Llama Gen and 256 256 for VAR. Following the original work, we resize Llama Gen samples to 256 256 whenever required for evaluation. Table 4 reports hyperparameters for chosen models in Figure 1 and Figure 6. All models are fine-tuned for 1 epoch on the Image Net dataset. Batch size 256, Learning rate (1e-5 or 2e-5).