TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation
Authors: Gihyun Kwon, Jong Chul YE
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results demonstrate that Tweedie Mix can generate highquality multi-concept generation results on both of image and video domains. More results can be found in the experiment section. Section 5: EXPERIMENTAL RESULTS. Table 1: Quantitative Evaluation of Multi-Concept Image Generation. Figure 5: Qualitative Evaluation of Multi-Concept Image Generation. Table 2: Ablation Study on Image Generation. Quantitative evaluation on ablation study. |
| Researcher Affiliation | Collaboration | Gihyun Kwon KRAFTON EMAIL. Jong Chul Ye Kim Jaechul Graduate School of AI, KAIST EMAIL |
| Pseudocode | No | The paper describes methods and formulas in prose and mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Results and source code are in our project page.1 https://github.com/Kwon Gihyun/Tweedie Mix |
| Open Datasets | Yes | For the evaluation dataset, we utilized the dataset proposed in the prior work, drawing from various data sources for both quantitative and qualitative analyses. For the quantitative evaluation, we selected 32 distinct concepts from the Custom Concept 101 dataset (Kumari et al., 2023), organized into 10 unique combinations. |
| Dataset Splits | No | The paper mentions selecting 32 distinct concepts from the Custom Concept 101 dataset for quantitative evaluation and expanding the concept pool for qualitative analysis. It also states "All the dataset contains 5 8 images per each concept." However, it does not provide specific training, validation, or test splits (e.g., percentages or counts for different sets) for the datasets used in its experiments, nor does it refer to standard predefined splits for its evaluation setup. |
| Hardware Specification | Yes | In terms of sampling time, it takes approximately 30 seconds using a single NVIDIA RTX 3090 GPU. This process took approximately 50 seconds on a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using "Stable Diffusion 2.1 or higher" as the backbone model and refers to specific models like "langsam (Medeiros, 2023) package, which combines Grounding DINO (Liu et al., 2023b) and Segment-Anything models (Kirillov et al., 2023)" and "I2VGen-XL (Zhang et al., 2023b)". However, it does not provide specific version numbers for general software dependencies like programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries (e.g., CUDA) that are typically needed for reproducibility. |
| Experiment Setup | Yes | Regarding sampling hyperparameters, we set the reference timestep tcon for content-aware sampling to 0.8T, and we found that values between 0.8T and 0.7T did not significantly affect output quality. The total timestep is set to T=50, and the we used image resolution of 768x768. For resampling, we used P = 10... For the video model, we used the recently proposed image-to-video model, I2VGen-XL (Zhang et al., 2023b). For video sampling, we set T=50. The total number of frames was 16, with a resolution of 512x512. For the lowest resolution blocks, we set η = 1, and for the first upsampling block, we set η = 0.3. |