ToMA: Token Merge with Attention for Diffusion Models
Authors: Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate To MA on two of the most widely used diffusion models: the UNet-based SDXL-base and the Di T-based Flux.1-dev, to generate 1024 1024 images, using the Diffusers framework. Importantly, To MA is architecture-agnostic and can be readily extended to other diffusion models (e.g., SD2, SD3.5). Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009). Metrics To assess image quality, we use CLIP-T, DINO, and FID (Radford et al., 2021; Caron et al., 2021; Heusel et al., 2017). CLIP-T measures semantic alignment between images and prompts via cosine similarity between image-text embeddings. DINO measures perceptual consistency by comparing visual features between the original generated image and its counterpart produced with the merge method applied. FID (Fr echet Inception Distance) quantifies distributional similarity to real images based on Inception-V3 feature statistics. For FID CLIP-T and DINO in the main experiments shown in this section, we generate 3,000 images and compute scores against ground-truths from Image Net 1K. For the ablation experiments listed in Appendix F, we generate images from 50 prompts with three random seeds each and report the average. FID is omitted on GEMRec due to the lack of paired images. Inference latency is reported as the median wall-clock time over 100 runs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, New York University. Correspondence to: Wenbo Lu <EMAIL>, Shaoyi Zheng <EMAIL>, Yuxuan Xia <EMAIL>, Shenji Wan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1: Greedy Algorithm Input: Ground set V, submodular function f : 2V R, and budget k Output: Selected subset A of size at most k Initialize A ; for i = 1 to k do Select v = arg maxv V\A f(v |A); Update A A {v }; Algorithm 2: Greedy Algorithm for Token Selection Algorithm 3: To MA with Local Regions |
| Open Source Code | No | The paper does not provide an explicit statement about releasing their source code, nor a link to a code repository for the methodology described. It mentions using third-party frameworks like 'Diffusers' and 'xformers' but does not offer its own implementation code. |
| Open Datasets | Yes | Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009). |
| Dataset Splits | No | The paper describes generating images for evaluation (e.g., "we generate 3,000 images and compute scores against ground-truths from Image Net 1K" and "we generate images from 50 prompts with three random seeds each"), but it does not specify how the prompts or ground-truth images are split into training, validation, or test sets in a way that would allow direct reproduction of data partitioning. |
| Hardware Specification | Yes | To MA achieves at least 1.24 practical speedup when paired with Flash Attention2, State-of-the-art results across different diffusion models (e.g., SDXL-base, Flux.1-dev) and GPU architectures (NVIDIA RTX6000, V100, RTX8000). |
| Software Dependencies | No | The paper mentions using the 'Diffusers framework' and 'Flash Attention2', and 'xformers' but does not specify any version numbers for these software dependencies, which would be crucial for reproduction. |
| Experiment Setup | Yes | Setup We evaluate To MA on two of the most widely used diffusion models: the UNet-based SDXL-base and the Di T-based Flux.1-dev, to generate 1024 1024 images, using the Diffusers framework. Prompts are drawn from the GEMRec dataset (Guo et al., 2024) and Image Net 1K names of the classes (Deng et al., 2009). Inference latency is reported as the median wall-clock time over 100 runs. To Do uses a fixed merge ratio of 75%, corresponding to a 4-to-1 token downsampling scheme. We reuse the destination for 10 denoising steps and reuse merge weights for 5 steps, with each block of a given type sharing one set. No reuse across denoising timesteps in Flux.1-dev but within blocks of the same kind. |