Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
Authors: Alireza Ganjdanesh, Reza Shirkavand, Shangqian Gao, Heng Huang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate APTP s effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g., prompts for generating text images, assigning them to higher capacity codes. Our code is available here. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Maryland, College Park 2Department of Computer Science, Florida State University Authors Contributed Equally. Correspondence to EMAIL |
| Pseudocode | No | The paper describes methods using mathematical formulations and descriptive text, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks are present. |
| Open Source Code | Yes | Our code is available here. |
| Open Datasets | Yes | We use Conceptual Captions 3M (CC3M) (Sharma et al., 2018) and MS-COCO (Lin et al., 2014) as our target datasets and prune Stable Diffusion V2.1 (Rombach et al., 2022) using APTP in our experiments. |
| Dataset Splits | Yes | We evaluate all models with FID (Heusel et al., 2017), CLIP (Hessel et al., 2021), and CMMD (Jayasumana et al., 2023) scores using 14k samples in the validation set of CC3M and 30k samples from the MS-COCO s validation split. For quantitative evaluation of models pruned on CC3M, we use its validation dataset of approximately 14k samples. For COCO, we sample 30k captions of unique images from its 2014 validation dataset. |
| Hardware Specification | Yes | We measure models MACs/Latency with the input resolution of 768 on an A100 GPU. The effective pruning batch size is 1024, achieved by training on 16 NVIDIA A100 GPUs with a local batch size of 64. |
| Software Dependencies | No | The paper mentions software components like Sentence Transformer and AdamW, but does not provide specific version numbers for these or other key software libraries and frameworks used for the experiments. |
| Experiment Setup | Yes | We train at a fixed resolution of 256 256 across all settings. During pruning, we first train the architecture predictor for 500 iterations as a warm-up phase. During this warm-up phase, we directly use its predicted architectures for pruning. Then, we start architecture codes and train the architecture predictor jointly with the codes for an additional 2500 iterations. We use the Adam W Loshchilov & Hutter (2019) optimizer and a constant learning rate of 0.0002 for both modules, with a 100-iteration linear warm-up. The effective pruning batch size is 1024, achieved by training on 16 NVIDIA A100 GPUs with a local batch size of 64. The temperature of the Gumbel-Sigmoid reparametrization (Eq. 9) is set to γ = 0.4. We set the regularization strength of the optimal transport objective (Eq. 5) to ϵ = 0.05. We use 3 iterations of the Sinkhorn-Knopp algorithm Cuturi (2013) to solve the optimal transport problem Caron et al. (2020). We set the contrastive loss temperature τ to 0.03. The total pruning loss is the weighted average of DDPM loss, distillation loss, resource loss, and contrastive loss (see Eq. 15) with weights λdistill = 0.2, λres = 2.0, and λcont = 100.0. After the pruning phase, we fine-tune the experts with the prompts assigned to them for 30,000 iterations using the Adam W optimizer, a fixed learning rate of 0.00001, and a batch size of 128. Upon experiments, we observed that higher weights of the DDPM loss result in unstable fine-tuning and slow convergence. As a result, we set the DDPM loss weight in the fine-tuning loss (Eq. 30) αDDPM to 0.0001. We set αdistill = 1.0. For sample generation, we use the classifier-free guidance Ho & Salimans (2022) technique with the scale of 7.5 and 25 steps of the PNDM sampler Liu et al. (2022). |