A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Authors: Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform controlled experiments on Image Net across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. |
| Researcher Affiliation | Collaboration | Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, EMAIL Shifeng Zhang & Sarah Parisot EMAIL Work done at Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes methods using prose and mathematical equations (Eq. 1-14) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://github.com/guoxoug/transformer-imagenet-ctrl. |
| Open Datasets | Yes | To this end, we perform controlled experiments on Image Net (Deng et al., 2009) over two representative but contrasting generative modelling approaches... and we train and evaluate on class-conditioned Image Net (Deng et al., 2009), a well-established benchmark for image generation. |
| Dataset Splits | Yes | For most evaluations we generate 10K samples for evaluation, conditioned on controls extracted from the first 10 images of each of the 1000 classes in the Image Net validation dataset. In a few cases, to compare with the literature, we generate using controls from all 50K validation images. We use fixed random seeds. |
| Hardware Specification | Yes | Inference is performed on a single NVIDIA Tesla V100-32GB-SXM2. |
| Software Dependencies | No | The paper mentions using 'kornia' for Canny edge map extraction but does not provide specific version numbers for it or any other key software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We finetune for 10 epochs with control conditioning with batch size 256 ( 50K iterations) using the original optimisers and hyperparameters. Following the original papers we linearly increase guidance scale γy from zero over generation scales for VAR, whilst keeping it constant for Si T. |