A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Authors: Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform controlled experiments on Image Net across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency.
Researcher Affiliation Collaboration Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, EMAIL Shifeng Zhang & Sarah Parisot EMAIL Work done at Huawei Noah s Ark Lab
Pseudocode No The paper describes methods using prose and mathematical equations (Eq. 1-14) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/guoxoug/transformer-imagenet-ctrl.
Open Datasets Yes To this end, we perform controlled experiments on Image Net (Deng et al., 2009) over two representative but contrasting generative modelling approaches... and we train and evaluate on class-conditioned Image Net (Deng et al., 2009), a well-established benchmark for image generation.
Dataset Splits Yes For most evaluations we generate 10K samples for evaluation, conditioned on controls extracted from the first 10 images of each of the 1000 classes in the Image Net validation dataset. In a few cases, to compare with the literature, we generate using controls from all 50K validation images. We use fixed random seeds.
Hardware Specification Yes Inference is performed on a single NVIDIA Tesla V100-32GB-SXM2.
Software Dependencies No The paper mentions using 'kornia' for Canny edge map extraction but does not provide specific version numbers for it or any other key software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We finetune for 10 epochs with control conditioning with batch size 256 ( 50K iterations) using the original optimisers and hyperparameters. Following the original papers we linearly increase guidance scale γy from zero over generation scales for VAR, whilst keeping it constant for Si T.