ControlAR: Controllable Image Generation with Autoregressive Models

Authors: Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments can demonstrate the controllability of the proposed Control AR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that Control AR surpasses previous state-of-the-art controllable diffusion models, e.g., Control Net++. The paper includes a dedicated section '4 EXPERIMENTS' with subsections '4.1 EXPERIMENTAL SETUP', '4.2 EXPERIMENTAL RESULTS', and '4.3 ABLATION STUDIES', detailing the use of datasets, metrics like F1-Score, FID, RMSE, SSIM, and mIoU, and comparing performance against other methods.
Researcher Affiliation Collaboration 1 School of EIC, Huazhong University of Science and Technology 2 Department of Computer Science, The University of Hong Kong 3 vivo AI Lab
Pseudocode No The paper describes methods and architectures using text and diagrams (e.g., Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The code, models, and demo will soon be available at https://github.com/hustvl/Control AR.
Open Datasets Yes For the former, we follow Control Net (Zhang et al., 2023a) to extract the canny edges and depth maps of the images in Image Net (Deng et al., 2009) for training. In T2I experiments, we train controllable generation for segmentation masks, canny edges, hed edges, lineart edges, and depth maps. For segmentation masks, we use ADE20K (Zhou et al., 2017) and COCOStuff (Caesar et al., 2018) as training data... Furthermore, we use a subset of LAION-Aesthetics (Schuhmann et al., 2022), Multi Gen-20M (Qin et al., 2023b), as the training data for canny edge, hed edge, lineart edge, and depth map controllable generation.
Dataset Splits Yes The quantity of images from all datasets utilized in our experiment is detailed in Tab. 7. We utilize the Image Net-1K (Deng et al., 2009) as the training dataset for class-to-image controllable generation, encompassing a total of 1,000 classes. Table 7: Details of different dataset. Image Net-1K Training Samples 1281188 Evaluation Samples 50000. ADE20K Training Samples 20210 Evaluation Samples 2000. COCOStuff Training Samples 118287 Evaluation Samples 500. Multi Gen-20M Training Samples 2810616 Evaluation Samples 5000.
Hardware Specification Yes We use 8 Nvidia A100 80G GPUs to complete text-to-image controllable generation experiments based on Llama Gen-XL (Sun et al., 2024).
Software Dependencies No The paper mentions several models and optimizers (e.g., 'Adam W optimizer (Kingma, 2014)', 'Llama Gen', 'Ai M', 'T5 encoder', 'Mini GPT-4', 'Mask2Former', 'Deep Labv3') but does not specify version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes The learning rate is set to 1e-4 and 8e-4 for training Llama Gen and Ai M respectively. We use the image size of 256 256, with a batch size of 256 for canny edge and depth maps. In T2I experiments, we mainly use Llama Gen-XL... We employ the Adam W optimizer with a learning rate of 5e-5 and resize both input and control images to 512 512 for comparison with other methods. The batch size settings and GPU hours during training can be found in Tab. 8.