Local Conditional Controlling for Text-to-Image Diffusion Models

Authors: Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Zheng Yang, Xiaofei He, Wei Zhao, Qinglin Lu, Wei Liu, Boxi Wu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions. 4 Experiments 4.1 Dataset and Evaluation 4.2 Comparison with Baselines
Researcher Affiliation Collaboration Yibo Zhao1,2, Liang Peng2, Yang Yang1,2, Zekai Luo1,2, Hengjia Li1,2, Yao Chen2,3 Zheng Yang2, Xiaofei He1,2, Wei Zhao4, Qinglin Lu5, Wei Liu5, Boxi Wu3* 1State Key Lab of CAD&CG, Zhejiang University 2Fabu Inc 3The School of Software Technology, Zhejiang University 4Xidian University 5Tencent Inc
Pseudocode No The paper describes methods in prose and mathematical equations (e.g., equations 1-10) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We utilized the COCO(Lin et al. 2014) validation set with 80 object categories, selecting one random caption per image to create a dataset of 5k generated images.
Dataset Splits No The paper mentions using the COCO validation set and creating an 'Attend-Condition dataset' but does not specify training/test/validation splits (percentages, counts, or explicit methodology) for the experimental evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions several models and frameworks (e.g., Stable Diffusion, CLIP, BLIP-2), but it does not specify any software libraries or their version numbers that are critical for reproducing the experiments.
Experiment Setup Yes Our initial objective in local control is to identify the most suitable object for generation within the control region at timestep t. The resulting object token indices are denoted as Ct control. In our method, the sum of attention scores within the local control region is employed as the criterion. At denoising steps t > βT, we identify the object with the highest sum attention score within the local control region as the Ct control. β is a hyperparameter that acts on the total timesteps T. ...A β between 0.8 to 0.9 yields good results.