Multimodal Promptable Token Merging for Diffusion Models

Authors: Cheng-Yao Hong, Tyng-Luh Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility. The paper includes a dedicated section titled "4 Experiments" with quantitative results in tables (Tables 2, 3, 4, 5) and qualitative visualizations (Figure 4).
Researcher Affiliation Academia Institute of Information Science, Academia Sinica, Taiwan EMAIL. Both authors are affiliated with Academia Sinica, a public research institution, and their email addresses use the '.edu.tw' domain implied by iis.sinica.edu.tw.
Pseudocode No The paper describes the proposed method using textual explanations and mathematical formulas (e.g., equations 1, 2, 6-12) and provides architectural diagrams (Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository in the main text or the provided abstract/introduction/conclusion.
Open Datasets Yes We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets.
Dataset Splits Yes We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets. ImageNet-1k, LVIS, and COCO are all standard benchmark datasets with well-defined, commonly used splits.
Hardware Specification No We thank National Center for High-performance Computing for providing computing resources. This statement acknowledges the use of computing resources but does not provide specific details about the hardware used (e.g., specific GPU or CPU models, memory, etc.).
Software Dependencies No The paper mentions using Stable Diffusion version 1.5 and PLMS for image generation, and CLIP for prompt encoding. However, it does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used for implementation.
Experiment Setup Yes The stable diffusion version 1.5 (SD v1.5) (Rombach et al. 2022), involving 50 diffusion steps via PLMS, is used to generate two samples per category on Image Net-1k, resulting in a total of 2,000 images of resolution 512 512. The classifier-free guidance scale is set at 7.5, and the prompt we used for this task is The image of [category name]. ... The object detection is carried out using Res Net-50, Res Net-101 (He et al. 2016), and Swin Transformer (Liu et al. 2021) as the backbones for encoding images.