Accelerating Diffusion Transformers with Token-wise Feature Caching
Authors: Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Pix Art-α, Open Sora, Di T and FLUX demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36 and 1.93 acceleration are achieved on Open Sora and Pix Art-α with almost no drop in generation quality. ... Abundant experiments on Pix Art-α, Open Sora, and Di T have been conducted, which demonstrates that To Ca achieves a high acceleration ratio while maintaining nearly lossless generation quality. |
| Researcher Affiliation | Academia | Chang Zou1,2 Xuyang Liu3 Ting Liu4 Siteng Huang5 Linfeng Zhang1 1Shanghai Jiao Tong University 2University of Electronic Science & Technology of China 3Sichuan University 4National University of Defense Technology 5Zhejiang University |
| Pseudocode | Yes | Algorithm 1 To Ca Input: current timestep t, current layer id l. 1: if current timestep t is a fresh step then 2: Fully compute Fl(x). 3: Cl(x) := Fl(x); # Update the cache. 4: else 5: S(xi) = P4 j=1 λj sj; # Compute the cache score for each token. 6: ICompute := Top K(S(xi), R%); # Fetch the index of computed tokens. 7: for all tokens xi do 8: if i ICompute then 9: Compute Fl(xi) through the neural layer. 10: Cl(xi) := Fl(xi); # Update the cache. 11: end if 12: end for 13: end if 14: return Fl(x). # return features for both cached and computed tokens for the next layer. |
| Open Source Code | Yes | Code: https://github.com/Shenyi-Z/ToCa ... Our codes have been released for further exploration in this domain. |
| Open Datasets | Yes | For text-to-image generation, we utilize 30,000 captions randomly selected from COCO-2017 (Lin et al., 2014) to generate an equivalent number of images. ... For class-conditional image generation, we uniformly sample from 1,000 classes in Image Net (Deng et al., 2009) to produce 50,000 images at a resolution of 256 × 256, evaluating performance using FID-50k (Heusel et al., 2017). Additionally, we employ s FID, Precision, and Recall as supplementary metrics. ... We leverage the VBench framework (Huang et al., 2024), generating 5 videos for each of the 950 benchmark prompts under different random seeds, resulting in a total of 4,750 videos. |
| Dataset Splits | No | The paper uses subsets or generated data for evaluation metrics (e.g., 30,000 captions from COCO-2017 to generate images, 50,000 images sampled from Image Net classes for evaluation, 5 videos for each of 950 VBench prompts). These describe the *evaluation setup* rather than explicit training/validation/test splits of a specific dataset for model training or general use. |
| Hardware Specification | Yes | We conduct experiments on three commonly-used Di T-based models across different generation tasks, including Pix Art-α (Chen et al., 2024a) for text-to-image generation, Open Sora (Zheng et al., 2024) for text-to-video generation, and Di T-XL/2 (Peebles & Xie, 2023) for class-conditional image generation with NVIDIA A800 80GB GPUs. ... All of our experiments were conducted on 6 A800 GPUs, each with 80GB of memory, running CUDA version 12.1. ... The CPUs used across all experiments were 84 v CPUs from an Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz. |
| Software Dependencies | Yes | All of our experiments were conducted on 6 A800 GPUs, each with 80GB of memory, running CUDA version 12.1. The Di T model was executed in Python 3.12 with Py Torch version 2.4.0, while Pix Art-α and Open Sora were run in Python 3.9. The Py Torch version for Pix Art-α was 2.4.0, and for Open Sora it was 2.2.2. |
| Experiment Setup | Yes | For each model, we configure different average forced activation cycles N and average caching ratios R for To Ca as follows: Pix Art-α: N = 3 and R = 70%, Open Sora: N = 3 for temporal attention, spatial attention, MLP, and N = 6 for cross-attention, with R = 85% exclusively for MLP, and Di T: N = 4 and R = 93%. ... Each model utilizes its default sampling method: DPM-Solver++ (Lu et al., 2022b) with 20 steps for Pix Art-α, rflow (Liu et al., 2023) with 30 steps for Open Sora and DDPM (Ho et al., 2020) with 250 steps for Di T-XL/2. ... For Pix Art-α: We set the average forced activation cycle of To Ca to N = 2, supplemented with a dynamic adjustment parameter wt = 0.1. The parameter λt = 0.4 adjusts R at different time steps, and the average caching ratio is R = 70%. The parameter rl = 0.3 adjusts R at different depth layers. The module preference weight rtype = 1.0 shifts part of the computation from cross-attention layers to MLP layers. |