Visual Autoregressive Modeling for Image Super-Resolution

Authors: Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, Chao Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes are released at https: //github.com/quyp2000/VARSR. 4. Experiments 4.1. Experimental Setups Datasets. We train VARSR on our large-scale dataset with negative samples using Real-ESRGAN s degradation pipeline (Wang et al., 2021) to synthesize LR-HR image pairs.
Researcher Affiliation Collaboration 1Tsinghua University, Beijing, China 2Kuaishou Technology, Beijing, China. Correspondence to: Kun Yuan <EMAIL>.
Pseudocode No The paper describes the methodology using textual explanations and architectural diagrams (e.g., Fig. 1, Fig. 2, Fig. 3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our codes are released at https: //github.com/quyp2000/VARSR.
Open Datasets Yes We collect a new large-scale dataset with 4 million high-quality and high-resolution images across over 3k categories, ensuring rich details and clear semantics. ... We collect billions of images from public datasets (e.g., LAION (Schuhmann et al., 2022), Data Comp (Gadre et al., 2023)) and internal datasets. ... We create the synthetic validation set DIV2K-VAL by randomly cropping 3k patches from the DIV2K (Agustsson & Timofte, 2017) validation set, and for real-world evaluation, Dreal SR (Cai et al., 2019) and Real SR (Wei et al., 2020) are center-cropped. ... We sample 50k low-quality images from various manually annotated image quality assessment (IQA) datasets (e.g., Kon IQ10K (Hosu et al., 2020), CLIVE (Ghadiyaram & Bovik, 2016)) and image aesthetics assessment (IAA) dataset AVA (Murray et al., 2012) as negative samples added to our database.
Dataset Splits Yes We train VARSR on our large-scale dataset with negative samples using Real-ESRGAN s degradation pipeline (Wang et al., 2021) to synthesize LR-HR image pairs. Both synthetic and real-world datasets are utilized for a comprehensive evaluation. We create the synthetic validation set DIV2K-VAL by randomly cropping 3k patches from the DIV2K (Agustsson & Timofte, 2017) validation set, and for real-world evaluation, Dreal SR (Cai et al., 2019) and Real SR (Wei et al., 2020) are center-cropped. Following (Wang et al., 2024a), all HR images have a resolution of 512 512, and LR images are 128 128. In training, HR images are divided into high and low-quality classes, with positive and negative embeddings cp, cn for control.
Hardware Specification Yes Experiments are performed on 32 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using an "Adam W (Loshchilov & Hutter, 2017) optimizer" and a "GPT-2 style (Radford et al., 2019) transformer" but does not provide specific version numbers for programming languages or libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We utilize an Adam W (Loshchilov & Hutter, 2017) optimizer with batchsize=128, weight decay=5e-2, and learning rate=5e-5. VQVAE, C2I pretraining, and ISR finetuning run for 10k, 40k, and 20k iterations, respectively. The loss balancing coefficient λ is 2.0, and the dropout ratio pd is 0.1. The guidance scale λs linearly increases to 6.0 as the scale increases.