reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Promptable Token Merging for Diffusion Models

Authors: Cheng-Yao Hong, Tyng-Luh Liu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MPTM significantly reduces computational costs without compromising essential information in generative image tasks. When integrated into diffusion-based detection architectures, MPTM outperforms existing state-of-the-art methods by 2.3% in object detection tasks. Additionally, when applied to multimodal diffusion models, MPTM maintains high-quality output while achieving a 2.9-fold increase in throughput, highlighting its versatility. The paper includes a dedicated section titled "4 Experiments" with quantitative results in tables (Tables 2, 3, 4, 5) and qualitative visualizations (Figure 4).
Researcher Affiliation	Academia	Institute of Information Science, Academia Sinica, Taiwan EMAIL. Both authors are affiliated with Academia Sinica, a public research institution, and their email addresses use the '.edu.tw' domain implied by iis.sinica.edu.tw.
Pseudocode	No	The paper describes the proposed method using textual explanations and mathematical formulas (e.g., equations 1, 2, 6-12) and provides architectural diagrams (Figure 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the described methodology, nor does it provide a direct link to a code repository in the main text or the provided abstract/introduction/conclusion.
Open Datasets	Yes	We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets.
Dataset Splits	Yes	We perform text-to-image generation on the Image Net-1k dataset as To Me SD (Bolya and Hoffman 2023). ... for promptable object detection on the Large Vocabulary Instance Segmentation (LVIS) (Gupta, Doll ar, and Girshick 2019) and COCO (Lin et al. 2014) datasets. ImageNet-1k, LVIS, and COCO are all standard benchmark datasets with well-defined, commonly used splits.
Hardware Specification	No	We thank National Center for High-performance Computing for providing computing resources. This statement acknowledges the use of computing resources but does not provide specific details about the hardware used (e.g., specific GPU or CPU models, memory, etc.).
Software Dependencies	No	The paper mentions using Stable Diffusion version 1.5 and PLMS for image generation, and CLIP for prompt encoding. However, it does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used for implementation.
Experiment Setup	Yes	The stable diffusion version 1.5 (SD v1.5) (Rombach et al. 2022), involving 50 diffusion steps via PLMS, is used to generate two samples per category on Image Net-1k, resulting in a total of 2,000 images of resolution 512 512. The classifier-free guidance scale is set at 7.5, and the prompt we used for this task is The image of [category name]. ... The object detection is carried out using Res Net-50, Res Net-101 (He et al. 2016), and Swin Transformer (Liu et al. 2021) as the backbones for encoding images.