reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ToMA: Token Merge with Attention for Diffusion Models

Authors: Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate To MA on two of the most widely used diffusion models: the UNet-based SDXL-base and the Di T-based Flux.1-dev, to generate 1024 1024 images, using the Diffusers framework. Importantly, To MA is architecture-agnostic and can be readily extended to other diffusion models (e.g., SD2, SD3.5). Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009). Metrics To assess image quality, we use CLIP-T, DINO, and FID (Radford et al., 2021; Caron et al., 2021; Heusel et al., 2017). CLIP-T measures semantic alignment between images and prompts via cosine similarity between image-text embeddings. DINO measures perceptual consistency by comparing visual features between the original generated image and its counterpart produced with the merge method applied. FID (Fr echet Inception Distance) quantifies distributional similarity to real images based on Inception-V3 feature statistics. For FID CLIP-T and DINO in the main experiments shown in this section, we generate 3,000 images and compute scores against ground-truths from Image Net 1K. For the ablation experiments listed in Appendix F, we generate images from 50 prompts with three random seeds each and report the average. FID is omitted on GEMRec due to the lack of paired images. Inference latency is reported as the median wall-clock time over 100 runs.
Researcher Affiliation	Academia	1Department of Computer Science, New York University. Correspondence to: Wenbo Lu <EMAIL>, Shaoyi Zheng <EMAIL>, Yuxuan Xia <EMAIL>, Shenji Wan <EMAIL>.
Pseudocode	Yes	Algorithm 1: Greedy Algorithm Input: Ground set V, submodular function f : 2V R, and budget k Output: Selected subset A of size at most k Initialize A ; for i = 1 to k do Select v = arg maxv V\A f(v \|A); Update A A {v }; Algorithm 2: Greedy Algorithm for Token Selection Algorithm 3: To MA with Local Regions
Open Source Code	No	The paper does not provide an explicit statement about releasing their source code, nor a link to a code repository for the methodology described. It mentions using third-party frameworks like 'Diffusers' and 'xformers' but does not offer its own implementation code.
Open Datasets	Yes	Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009).
Dataset Splits	No	The paper describes generating images for evaluation (e.g., "we generate 3,000 images and compute scores against ground-truths from Image Net 1K" and "we generate images from 50 prompts with three random seeds each"), but it does not specify how the prompts or ground-truth images are split into training, validation, or test sets in a way that would allow direct reproduction of data partitioning.
Hardware Specification	Yes	To MA achieves at least 1.24 practical speedup when paired with Flash Attention2, State-of-the-art results across different diffusion models (e.g., SDXL-base, Flux.1-dev) and GPU architectures (NVIDIA RTX6000, V100, RTX8000).
Software Dependencies	No	The paper mentions using the 'Diffusers framework' and 'Flash Attention2', and 'xformers' but does not specify any version numbers for these software dependencies, which would be crucial for reproduction.
Experiment Setup	Yes	Setup We evaluate To MA on two of the most widely used diffusion models: the UNet-based SDXL-base and the Di T-based Flux.1-dev, to generate 1024 1024 images, using the Diffusers framework. Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009). Inference latency is reported as the median wall-clock time over 100 runs. To Do uses a fixed merge ratio of 75%, corresponding to a 4-to-1 token downsampling scheme. We reuse the destination for 10 denoising steps and reuse merge weights for 5 steps, with each block of a given type sharing one set. No reuse across denoising timesteps in Flux.1-dev but within blocks of the same kind.