Event-Customized Image Generation
Authors: Zhen Wang, Yilei Jiang, Dong Zheng, Jun Xiao, Long Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments have demonstrated the effectiveness of Free Event. Moreover, as a pioneering effort in this direction, we also collected two evaluation benchmarks from the existing dataset (i.e., SWi G (Pratt et al., 2020) and HICO-DET (Chao et al., 2015)) and the internet for event-customized image generation, dubbed SWi G-Event and Real-Event, respectively. Extensive experiments have demonstrated the effectiveness of Free Event. |
| Researcher Affiliation | Academia | 1Zhejiang University, Hangzhou, China 2The Hong Kong University of Science and Technology, Hong Kong, China. Work was done when Zhen Wang visited HKUST. Correspondence to: Long Chen <EMAIL>. All listed affiliations are academic institutions (universities). |
| Pseudocode | No | The paper describes the proposed method using descriptive text and architectural diagrams (Figures 2 and 3), but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Moreover, as a pioneering effort in this direction, we also collected two evaluation benchmarks from the existing dataset (i.e., SWi G (Pratt et al., 2020) and HICO-DET (Chao et al., 2015)) and the internet for event-customized image generation, dubbed SWi G-Event and Real-Event, respectively. |
| Dataset Splits | No | For quantitative evaluation, we present SWi G-Event, a benchmark derived from SWi G (Pratt et al., 2020) dataset, which comprises 5,000 samples with various events and entities, i.e., 50 kinds of different actions, poses, and interactions, where each kind of event has 100 reference images, and each reference image contains 1 to 4 entities with labeled bounding boxes and nouns. The paper describes the structure of the evaluation benchmarks but does not specify train/validation/test splits, as the proposed Free Event method is training-free and thus does not require such splits for model training. |
| Hardware Specification | Yes | Images are generated with a resolution of 512x512 on a NVIDIA A100 GPU1. |
| Software Dependencies | Yes | We use Stable Diffusion v2-1-base as base model for all methods. |
| Experiment Setup | Yes | The denoising process was set with 50 steps. For entity switching path, for all blocks and layers containing the crossattention module, we apply the cross-attention guidance during the first 10 steps. And apply the cross-attention regulation during the whole 50 steps. For event transferring path, we perform spatial feature injection for block and layer at {decoder block 1 :[layer 1]} during the whole 50 steps. And perform self-attention injection for blocks and layers at {decoder block 1 :[layer 1, 2], decoder block 2 :[layer 0, 1, 2], decoder block 3 :[layer 0, 1, 2]} during the first 25 steps. We set the classifier-free guidance scale to 15.0. |