Precise Parameter Localization for Textual Generation in Diffusion Models

Authors: Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., SDXL and Deep Floyd IF) and transformer-based (e.g., Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/. Section 3 is titled "EXPERIMENTAL SETUP", and includes details on "Benchmark" and "Metrics" like OCR F1 Score, Levenshtein distance, CLIP-T Score, MSE, SSIM, and PSNR.
Researcher Affiliation Academia Łukasz Staniszewski & Bartosz Cywi nski Warsaw University of Technology EMAIL Franziska Boenisch CISPA Helmholtz Center for Information Security EMAIL Kamil Deja Warsaw University of Technology IDEAS NCBR EMAIL Adam Dziedzic CISPA Helmholtz Center for Information Security EMAIL
Pseudocode Yes I PSEUDOCODE FOR LAYER LOCALIZATION We present in Algorithm 1 our method for creating a subset of diffusion model layers that control the content of visual text generated on images. Algorithm 1 Finding subset of layers Lours responsible for textual content generation
Open Source Code No Project page available at https://t2i-text-loc.github.io/.
Open Datasets Yes For training, we utilize a randomly chosen subset of 74,285 images from the MARIO-LAION 10M dataset (Chen et al., 2023). In order for the training text captions to contain text that is directly presented on the corresponding training image, we construct them according to the template An image with text saying <text> , where <text> constitutes of OCR labels corresponding to the image. Simple Bench consists of 400 prompts following the template A sign that says <keyword> . , while Creative Bench includes 400 more complex prompts adapted from Glyph Draw Ma et al. (2023), such as Flowers in a beautiful garden with the word <keyword> written. The keywords used in the benchmarks are from a pool of single-word candidates from Wikipedia and categorized into four buckets based on their frequency: Bucket1k top, Bucket10k 1k , Bucket100k 10k , and Bucketplus 100k. Both benchmarks contain the same set of keywords, which serve as text that should be generated in the images. ... In the experiments, each of the source prompts p S (we use 400 in total) contains a harmful word from LDNOOBW (2020).
Dataset Splits Yes In this work, we use 100 prompts from each benchmark, with words from Bucket1k top, as a validation set, and the remaining 300 prompts as a test set. The prompts from these benchmarks serve as the source prompts p S.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. There is no mention of specific GPU models, CPU models, or other computing resources like TPUs or cloud instance types.
Software Dependencies No To detect text in generated images, we use the Easy OCR model. We choose a non-multi-modal method for this task to ensure that OCR-based metrics are computed purely based on the text present in images. ... As a text detection model, we use the DBNet (Liao et al., 2020). ... for the text returned from OCR, we calculate the toxicity score using the RoBERTa-based classifier (Liu et al., 2022). ... To that end, we use the Deep Face library (Serengil & Ozpinar, 2021) that implements methods for detecting seven basic emotions from facial expressions.
Experiment Setup Yes For training, we utilize a randomly chosen subset of 74,285 images from the MARIO-LAION 10M dataset (Chen et al., 2023). In order for the training text captions to contain text that is directly presented on the corresponding training image, we construct them according to the template An image with text saying <text> , where <text> constitutes of OCR labels corresponding to the image. We compare the performance of applying LoRA to the localized layers with the baseline adaptation approach, for which we directly follow Hu et al. (2022) and apply LoRA to all cross-attention layers. We optimize both models until convergence and evaluate the quality of model generations after the next epochs on our test set introduced in Section 3. ... In Figure 8, we plot the recall and precision metrics across training steps. Notably, even with a substantially larger dataset in the Full model 200k configuration, the model exhibits a similar collapse to what is observed when training on smaller subsets. Moreover, both recall and precision remain largely unchanged across different setups, demonstrating the robustness of our approach, which focuses on fine-tuning specific layers. Additionally, in Figure 9, we plot the OCR F1 Score and CLIP-T metrics, highlighting that fine-tuning localized layers, even with as few as 20k samples, results in better performance than the Full model setup trained with 200k samples. We train each setup for 12k steps with a batch size equal to 512 and a learning rate of 1e-6. ... To further refine the identification of text generation capabilities in DMs, we investigate from which point in the diffusion denoising process the key and value matrices should be patched to achieve the highest performance in text editing. We present the results of this analysis in Figure 6. We observe that when starting the patching from the later timesteps t, we can observe better preservation in the visual attributes of a modified image and improve the quality of the generated text, increasing its similarity to the text from the target prompt p B. This trend aligns with the work by Hertz et al. (2023), where authors show that only the overall structure of an image is generated in the initial steps of the diffusion denoising process. Thus, in order to reduce the change in visual attributes, we apply our patching method to localized attention layers starting from timesteps: ts = 46 for SDXL, ts = 26 for SD3, and ts = 48 for Deep Floyd IF. Attention activations from timestep T to ts 1 remain unchanged while we patch all activations from timestep ts to 0.