Prompt-Free Conditional Diffusion for Multi-object Image Augmentation

Authors: Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. We validate the proposed framework and comparison methods on the MS-COCO [Lin et al., 2014] dataset, a relatively complex object detection dataset containing 80 categories, with an average of 7.7 objects per image. We use train2017 containing 118K images to train the proposed method and generate images for downstream task evaluation, and use the COCO validation set val2017 consisting of 5K images for generation quality evaluation. We use m AP (mean Average Precision) and AP50 to evaluate the generated data, and for generation quality evaluation, we use the widely used Frechet Inception Distance (FID) [Heusel et al., 2017] to evaluate the fidelity of the generated images. In addition, to evaluate the diversity of the generated images, we calculate the diversity score (DS) by comparing the LPIPS [Zhang et al., 2018] metric of paired images. Finally, to evaluate the object amounts of the generated images, we designed an instance quantity score (IQS) that detects the instance quantity of each category under multiple confidence settings using the pre-trained YOLOv8m [Jocher et al., 2023] and compares it with the original images.
Researcher Affiliation Academia Haoyu Wang1 , Lei Zhang1 , Wei Wei1 , Chen Ding2 and Yanning Zhang1 1Northwestern Polytechnical University 2Xi an University of Posts & Telecommunications
Pseudocode Yes Algorithm 1 Counting Loss Input: denoised image x i , open vocabulary object detector DOV , number of categories N c i , text prompt Si, class count list Lcount i , class index list Lindex i , counting loss step γ, counting loss threshold τ
Open Source Code No The paper states "Code is available at here." without providing a concrete link or specific repository details, which is insufficient for concrete access to source code.
Open Datasets Yes We validate the proposed framework and comparison methods on the MS-COCO [Lin et al., 2014] dataset, a relatively complex object detection dataset containing 80 categories, with an average of 7.7 objects per image.
Dataset Splits Yes We use train2017 containing 118K images to train the proposed method and generate images for downstream task evaluation, and use the COCO validation set val2017 consisting of 5K images for generation quality evaluation.
Hardware Specification Yes We fine-tune the model using Lo RA [Hu et al., 2021] at 512 512 resolution, we set the learning rate to 1e4, total batch size to 32, and train on two RTX 3090 GPUs using the Adam W [Loshchilov and Hutter, 2019] optimizer with constant scheduler.
Software Dependencies No The paper mentions models (Stable Diffusion XL, Grounding DINO), methods (LoRA), optimizers (AdamW), and schedulers (Euler), but does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We fine-tune the model using Lo RA [Hu et al., 2021] at 512 512 resolution, we set the learning rate to 1e4, total batch size to 32, and train on two RTX 3090 GPUs using the Adam W [Loshchilov and Hutter, 2019] optimizer with constant scheduler. In the inference stage, we use the Euler scheduler with 50 steps for generation.