TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Authors: Gihyun Kwon, Jong Chul YE

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that Tweedie Mix can generate highquality multi-concept generation results on both of image and video domains. More results can be found in the experiment section. Section 5: EXPERIMENTAL RESULTS. Table 1: Quantitative Evaluation of Multi-Concept Image Generation. Figure 5: Qualitative Evaluation of Multi-Concept Image Generation. Table 2: Ablation Study on Image Generation. Quantitative evaluation on ablation study.
Researcher Affiliation Collaboration Gihyun Kwon KRAFTON EMAIL. Jong Chul Ye Kim Jaechul Graduate School of AI, KAIST EMAIL
Pseudocode No The paper describes methods and formulas in prose and mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Results and source code are in our project page.1 https://github.com/Kwon Gihyun/Tweedie Mix
Open Datasets Yes For the evaluation dataset, we utilized the dataset proposed in the prior work, drawing from various data sources for both quantitative and qualitative analyses. For the quantitative evaluation, we selected 32 distinct concepts from the Custom Concept 101 dataset (Kumari et al., 2023), organized into 10 unique combinations.
Dataset Splits No The paper mentions selecting 32 distinct concepts from the Custom Concept 101 dataset for quantitative evaluation and expanding the concept pool for qualitative analysis. It also states "All the dataset contains 5 8 images per each concept." However, it does not provide specific training, validation, or test splits (e.g., percentages or counts for different sets) for the datasets used in its experiments, nor does it refer to standard predefined splits for its evaluation setup.
Hardware Specification Yes In terms of sampling time, it takes approximately 30 seconds using a single NVIDIA RTX 3090 GPU. This process took approximately 50 seconds on a single RTX 3090 GPU.
Software Dependencies No The paper mentions using "Stable Diffusion 2.1 or higher" as the backbone model and refers to specific models like "langsam (Medeiros, 2023) package, which combines Grounding DINO (Liu et al., 2023b) and Segment-Anything models (Kirillov et al., 2023)" and "I2VGen-XL (Zhang et al., 2023b)". However, it does not provide specific version numbers for general software dependencies like programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., CUDA) that are typically needed for reproducibility.
Experiment Setup Yes Regarding sampling hyperparameters, we set the reference timestep tcon for content-aware sampling to 0.8T, and we found that values between 0.8T and 0.7T did not significantly affect output quality. The total timestep is set to T=50, and the we used image resolution of 768x768. For resampling, we used P = 10... For the video model, we used the recently proposed image-to-video model, I2VGen-XL (Zhang et al., 2023b). For video sampling, we set T=50. The total number of frames was 16, with a resolution of 512x512. For the lowest resolution blocks, we set η = 1, and for the first upsampling block, we set η = 0.3.