Continuous Visual Autoregressive Generation via Score Maximization
Authors: Chenze Shao, Fandong Meng, Jie Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Image Net 256 256 benchmark (Deng et al., 2009) show that our approach achieves stronger visual generation quality than the traditional autoregressive Transformer that uses a discrete tokenizer. Compared to diffusion-based methods, our approach exhibits substantially higher inference efficiency, as it does not require multiple denoising iterations to recover the target distribution. |
| Researcher Affiliation | Industry | 1Pattern Recognition Center, We Chat AI, Tencent Inc. Correspondence to: Chenze Shao <EMAIL>, Fandong Meng <EMAIL>, Jie Zhou <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and components like the MLP generator using prose and mathematical equations, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Source code: https: //github.com/shaochenze/EAR. |
| Open Datasets | Yes | Experiments on the Image Net 256 256 benchmark (Deng et al., 2009) |
| Dataset Splits | No | The paper mentions using the ImageNet benchmark and its evaluation suite but does not specify the training, validation, or test splits used for the experiments within the provided text. It refers to 'the evaluation suite of Dhariwal & Nichol (2021)' which implies standard practices but no explicit split details are given. |
| Hardware Specification | Yes | The inference time is measured on a single A100 GPU. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer but does not provide specific version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | The random noise for the MLP generator has a size of dnoise = 64, independently drawn from a uniform distribution [ 0.5, 0.5] at each time step. We by default set α = 1 to calculate the energy loss. We train our model for a total of 800 epochs, where the first 750 epochs use the standard energy loss and the last 50 epochs reduces the temperature τtrain to 0.99. The inference temperature τinfer is set to 0.7. Our models are optimized by the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9, β2 = 0.95. The batch size is 2048. The learning rate is 8e-4 and the constant learning rate schedule is applied with linear warmup of 100 epochs. We use a weight decay of 0.02, gradient clipping of 3.0, and dropout of 0.1 during training. |