Event-Customized Image Generation

Authors: Zhen Wang, Yilei Jiang, Dong Zheng, Jun Xiao, Long Chen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have demonstrated the effectiveness of Free Event. Moreover, as a pioneering effort in this direction, we also collected two evaluation benchmarks from the existing dataset (i.e., SWi G (Pratt et al., 2020) and HICO-DET (Chao et al., 2015)) and the internet for event-customized image generation, dubbed SWi G-Event and Real-Event, respectively. Extensive experiments have demonstrated the effectiveness of Free Event.
Researcher Affiliation Academia 1Zhejiang University, Hangzhou, China 2The Hong Kong University of Science and Technology, Hong Kong, China. Work was done when Zhen Wang visited HKUST. Correspondence to: Long Chen <EMAIL>. All listed affiliations are academic institutions (universities).
Pseudocode No The paper describes the proposed method using descriptive text and architectural diagrams (Figures 2 and 3), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes Moreover, as a pioneering effort in this direction, we also collected two evaluation benchmarks from the existing dataset (i.e., SWi G (Pratt et al., 2020) and HICO-DET (Chao et al., 2015)) and the internet for event-customized image generation, dubbed SWi G-Event and Real-Event, respectively.
Dataset Splits No For quantitative evaluation, we present SWi G-Event, a benchmark derived from SWi G (Pratt et al., 2020) dataset, which comprises 5,000 samples with various events and entities, i.e., 50 kinds of different actions, poses, and interactions, where each kind of event has 100 reference images, and each reference image contains 1 to 4 entities with labeled bounding boxes and nouns. The paper describes the structure of the evaluation benchmarks but does not specify train/validation/test splits, as the proposed Free Event method is training-free and thus does not require such splits for model training.
Hardware Specification Yes Images are generated with a resolution of 512x512 on a NVIDIA A100 GPU1.
Software Dependencies Yes We use Stable Diffusion v2-1-base as base model for all methods.
Experiment Setup Yes The denoising process was set with 50 steps. For entity switching path, for all blocks and layers containing the crossattention module, we apply the cross-attention guidance during the first 10 steps. And apply the cross-attention regulation during the whole 50 steps. For event transferring path, we perform spatial feature injection for block and layer at {decoder block 1 :[layer 1]} during the whole 50 steps. And perform self-attention injection for blocks and layers at {decoder block 1 :[layer 1, 2], decoder block 2 :[layer 0, 1, 2], decoder block 3 :[layer 0, 1, 2]} during the first 25 steps. We set the classifier-free guidance scale to 15.0.