Multimodal Promptable Token Merging for Diffusion Models
Authors: Cheng-Yao Hong, Tyng-Luh Liu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility. The paper includes a dedicated section titled "4 Experiments" with quantitative results in tables (Tables 2, 3, 4, 5) and qualitative visualizations (Figure 4). |
| Researcher Affiliation | Academia | Institute of Information Science, Academia Sinica, Taiwan EMAIL. Both authors are affiliated with Academia Sinica, a public research institution, and their email addresses use the '.edu.tw' domain implied by iis.sinica.edu.tw. |
| Pseudocode | No | The paper describes the proposed method using textual explanations and mathematical formulas (e.g., equations 1, 2, 6-12) and provides architectural diagrams (Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository in the main text or the provided abstract/introduction/conclusion. |
| Open Datasets | Yes | We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets. |
| Dataset Splits | Yes | We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets. ImageNet-1k, LVIS, and COCO are all standard benchmark datasets with well-defined, commonly used splits. |
| Hardware Specification | No | We thank National Center for High-performance Computing for providing computing resources. This statement acknowledges the use of computing resources but does not provide specific details about the hardware used (e.g., specific GPU or CPU models, memory, etc.). |
| Software Dependencies | No | The paper mentions using Stable Diffusion version 1.5 and PLMS for image generation, and CLIP for prompt encoding. However, it does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | The stable diffusion version 1.5 (SD v1.5) (Rombach et al. 2022), involving 50 diffusion steps via PLMS, is used to generate two samples per category on Image Net-1k, resulting in a total of 2,000 images of resolution 512 512. The classifier-free guidance scale is set at 7.5, and the prompt we used for this task is The image of [category name]. ... The object detection is carried out using Res Net-50, Res Net-101 (He et al. 2016), and Swin Transformer (Liu et al. 2021) as the backbones for encoding images. |